Maximum Likelihood estimation with Quipu, part 2

In my previous post, I went over fitting the parameters of a Log-Normal distribution to a sample of observations, using Maximum Likelihood Estimation (MLE) and Quipu, my Nelder-Mead solver. MLE was overkill for the example I used, but today I want to illustrate some more interesting things you could do with MLE, building up from the same base setup.

Let’s do a quick recap first. I will be using the following libraries:

#r "nuget: MathNet.Numerics, 5.0.0"
#r "nuget: MathNet.Numerics.FSharp, 5.0.0"
#r "nuget: Plotly.NET, 5.0.0"
#r "nuget: Quipu, 0.5.2"

Our starting point is a sample of 100 independent observations, generated by a Log-Normal distribution with parameters Mu=1.3 and Sigma=0.3 (which describe the shape of the distribution), like so:

open MathNet.Numerics.Random
open MathNet.Numerics.Distributions

let mu, sigma = 1.3, 0.3
let rng = MersenneTwister 42
let duration = LogNormal(mu, sigma, rng)

let sample =
    duration.Samples()
    |> Seq.take 100
    |> Array.ofSeq

LogNormal distribution and histogram of the sample

If we want to find a distribution that fits the data, we need a way to compare how well 2 distributions fit the data. The Maximum Likelihood function does just that: it measures how likely it is that a particular distribution could have generated a sample - the higher the number, the higher the likelihood:

let logLikelihood sample distributionDensity =
    sample
    |> Array.sumBy (fun observation ->
        observation
        |> distributionDensity
        |> log
        )

There are many distributions we could try to fit to a sample. If we assume that a Log-Normal distribution is what generated our observations, the problem becomes simpler: out of all the parameters Mu and Sigma that correspond to a valid Log-Normal distribution, which pair is the most likely to have produced what we observe?.

We can then make our log-likelihood function more specific, and use Quipu to find the parameters that maximize that function:

open Quipu

let logLikelihood sample (mu, sigma) =
    if sigma < 0.0
    then - infinity
    else
        let distribution = LogNormal(mu, sigma)
        let density = distribution.Density
        sample
        |> Array.sumBy (fun t ->
            t
            |> density
            |> log
            )

logLikelihood sample
|> NelderMead.objective
|> NelderMead.maximize

… which happens to return a solution of (1.31, 0.29), pretty close to the values we used to generate that sample, (1.3, 0.3).

Is this overkill?

I mentioned earlier that using MLE for this particular example was overkill. A Log-Normal distribution, as the name suggests, has the following property: if X follows a Log-Normal distribution, then Ln(X) follows a Normal distribution. The mean Mu and standard deviation Sigma of a Normal distribution can be estimated in a direct fashion from a sample, so we can get estimates for our Log-Normal by transforming the sample into logarithms, and using that transformed sample instead:

// convert to log
let logSample = sample |> Array.map log
// mu is the sample average
let muEstimate = logSample |> Array.average
// sigma is the average distance from the average
let sigmaEstimate =
    let sampleSize = sample |> Array.length
    logSample
    |> Array.sumBy (fun x -> 
        pown (x - mu) 2 / (float (sampleSize - 1))
        )
    |> sqrt
printfn $"%.2f{muEstimate}, %.2f{sigmaEstimate}"
1.32, 0.30

This is both simpler and faster than using Quipu and MLE. So… why use MLE?

First, while we can easily estimate the parameters of a Log-Normal, this is not the case for other distributions in general. The Log-Normal is a special case, where a shortcut exists. If we thought the observations were generated by something that was not a Log-Normal, say, a Gamma or a Weibull, the MLE approach would still work. All we would need is an expression for the likelihood of an observation, that is, the density of the distribution. As an added benefit, we would also immediately get an answer to the question “is the data I am looking at more likely to be generated by a Log-Normal, or by this other distribution?” just by comparing their log likelihoods.

In the next section, I will follow this train of thought further, and explore what other type of questions you could ask of your data using Maximum Likelihood Estimation.

Censored observations

So far, we looked into the straightforward estimation case, where we have a nice, clean sample of observations. Now let’s look into more interesting things you could do with Maximum Likelihood Estimation.

Imagine for a minute that instead of observing all the data, we had only partial observations. As an example, let’s imagine that we are observing things through a device / sensor that can only record values above a certain level.

Building on our original example, let’s say we have a limit, and if the true value is below that limit, we can’t observed what the actual value was, but only that there was an observation.

For instance, if that limit was 4, instead of the original sample, we would now have the following partial sample, where None denotes the case where we know there was a value, but we don’t have a record of it because it was below the limit:

let limit = 4.0

let partial =
    sample
    |> Array.map (fun obs ->
        if obs < limit
        then None
        else Some obs
        )

Instead of 100 observations, we only have 42 actual values to work with:

partial
|> Array.choose id
|> Array.length

So how could we estimate the parameters of the underlying Log-Normal?

The naive way to do this would be to ignore the fact that our dataset is censored, and work off just the clean observations, like so:

// don't do this!
partial
|> Array.choose id
|> LogNormal.Estimate

This runs just fine, and estimates the parameters to be Mu=1.60 and Sigma=0.16, pretty far off from the correct values. It is incorrect, because we are treating the sample as if it had been generated by a Log-Normal, but it hasn’t. It is a Log-Normal, which has been transformed by another function. Treating it as a regular Log-Normal sample has no reason to produce the correct estimates.

So what can we do? Maximum Likelihood hinges on having a function that can calculate the likelihood of any observation, let’s see if we can do that. We have 2 cases to consider here:

If the true observation was below the limit=4, we observe None. The likelihood to observe None for a Log-Normal is therefore the probability that a value is less than 4. This we can get using the so-called cumulative distribution function.

If the true observation was above the limit, we observe Some value. The likelihood to observe value is, like before, the density distribution function.

Let’s modify our likelihood function to represent that:

let logLikelihood sample (mu, sigma) =
    if sigma < 0.0
    then - infinity
    else
        let distribution = LogNormal(mu, sigma)
        let density = distribution.Density
        let cumulative = distribution.CumulativeDistribution
        let likelihood obs =
            match obs with
            | None -> cumulative limit
            | Some value -> density value
        sample
        |> Seq.sumBy (likelihood >> log)

This function will return the likelihood of any observation, and we can then re-use the same approach as before, using Quipu to maximize the log-likelihood:

partial
|> logLikelihood
|> NelderMead.objective
|> NelderMead.maximize

Quipu estimates Mu=1.33 and Sigma=0.29. It is not as good as what we got using the full sample, which is not a surprise, given that we have only 42 clean observations instead of 100 initially. Still, this is pretty close to the true values of 1.3 and 0.3, and definitely much better than what the “incorrect” estimation gave us!

You could use the exact same technique if the observations were recorded only up to a certain limit. In that case, instead of using the cumulative function, you would compute the likelihood of an observation being greater than the limit as 1 - cumulative(observation).

Unknown limit

In the previous case, we assumed that we knew what the limit was. What if we didn’t? Again, we can use Maximum Likelihood, we just need to specify a function that computes the likelihood of each observation, given the model we assume.

Here, the transformation is as simple as turning the limit into a parameter of the model:

let logLikelihood sample (limit, mu, sigma) =
    if sigma < 0.0
    then - infinity
    else
        let distribution = LogNormal(mu, sigma)
        let density = distribution.Density
        let cumulative = distribution.CumulativeDistribution
        let likelihood obs =
            match obs with
            | None -> cumulative limit
            | Some value ->
                if value < limit
                then - infinity
                else density value
        sample
        |> Seq.sumBy (likelihood >> log)

Instead of assuming that we know the value of the limit, we turn it into a third argument to the likelihood function. In essence, what we are now saying is “how likely is it that the data was generated using a Log-Normal with parameters Mu and Sigma, AND this particular value of limit?”

Most of the function remains the same, with one small change here:

| Some value ->
    if value < limit
    then - infinity
    else density value

We cannot have an observation below the limit. Therefore, we make the likelihood of that -infinity. This will force the search to avoid impossible values for limit.

And we are done! All we need to do next is search for a set of 3 values that maximize the likelihood, like so:

partial
|> logLikelihood
|> NelderMead.objective
|> NelderMead.startFrom (Start.around([ 0.0; 0.0; 1.0 ], 0.1))
|> NelderMead.maximize

Quipu proposes the following solution: limit=4.01, Mu=1.31 and Sigma=0.34, which is once again fairly close to the correct values.

I had to make one small adjustment to the Quipu setup to make this work. After setting the objective function, I direct Quipu to start the search around limit=0, Mu=0 and Sigma=1, within 0.1 of these 3 values.

The reasons are a bit technical. By default, Quipu will start its search around 0, or, if we are searching for 3 parameters, around 0, 0, 0. In most cases, starting around 0 works just fine, but this depends on the objective function. In our particular situation, it doesn’t, and causes the solver to terminate without finding a solution.

This was also a reminder that I need to revisit how Quipu surfaces solver issues to the user.

I haven’t worked out the details of what goes wrong exactly, but I believe the issue goes something like this. The default initial candidates for 3 parameters (the so-called simplex) ends up being these 4 candidates (see this file for details):

0.577, -0.577, -0.577
-0.577, 0.577, -0.577
-0.577, -0.577, 0.577
0.577, 0.577, 0.577

Now out of these 4 candidates, 2 already have “bad” starting values for Sigma (which should always be positive), at -0.577. This might not be the only problem, but we are certainly not giving the solver an easy starting point. So rather than using the defaults, we supply the solver with a starting point where none of the initial candidates is problematic, which addresses the issue. And… we are done.

Parting thoughts

I remember being introduced to Maximum Likelihood Estimation while I was still a student, and feeling very intimidated. As a result, I avoided the subject for a long time. I wish I hadn’t! I really enjoy how explicit the approach is: state what you believe the probability model to be, and find the parameters that are the most likely to have produced the observations you have.

My goal with this post was to illustrate how powerful and versatile the technique is. We started with a simple problem, estimating the parameters of a Log-Normal distribution on a sample. In this particular case, Maximum Likelihood Estimation was more complex than necessary. However, we were able to re-use essentially the same scaffold, with minor modifications, to tackle much more complicated cases like censored data. As long as our model involves distributions whose density or cumulative distribution can be evaluated, we can fit a model.

Anyways, this is where I will leave it today!

Do you have a comment or a question?
Ping me on Mastodon!