Attempting to auto-tune RANSAC

In my previous posts, I looked into the RANSAC algorithm. One thing I wondered about is, could the algorithm usage be made simpler by automatically tuning some of its input parameters? That is, rather than requiring the user to enter parameters, can we derive reasonable values for these parameters from the data? In this post, I will go over my first attempt. This was not a success, in that it did not produce immediately useful results, but it gave me a better understanding of the algorithm, and the process was interesting, so that’s what this post will focus on.

In the algorithm as described in Wikipedia, 2 parameters stand out in particular:

My thinking here was that, given a particular model, I should be able to derive these numbers from the data itself. t, the threshold value, is about how far predictions are from the correct value, something we can measure. d is about how many predictions should be close to their actual values overall. It is related to a different measure, the proportion of outliers. If I had, say, 20% of outliers in my sample, I would expect roughly 80% of my model predictions to be close to the correct value.

RANSAC assumes that you, the user, will input these values. What I wondered
about is, if all I had was a model and a dataset, how I would go about evaluating the value of d. Or, stated differently, could I estimate p(outliers), the proportion of outliers in a dataset?

Thought process

Let’s try, starting with clarifying some assumptions:

We can estimate a model using a very small random subset of the data. If we do so, 3 situations can occur:

Now if I knew p, I could compute the probability that a random sample of n observations is in Case 1, 2 or 3. Case 1 occurs if I pick inliners n times. Picking an inliner has a probability (1 - p) of occurring, so Case 1 has a probability of (1 - p) ^ n. Similarly Case 2 has a probability of p ^ n, and Case 2 is whatever is left.

Let’s illustrate on a concrete example, with p = 0.2 (20% outliers) and n = 3. In that case, the probability of Case 1 is 0.8 * 0.8 * 0.8 = 0.512: we picked a “good sample”, with no outlier, and the estimated model should be good. Conversely, with a probability of 1.0 - 0.512 = 0.488, we picked 1 or more outliers (Case 2 or 3), and the model we estimated should be a bad fit.

So far, so good. What can I do with that?

I could pick, say, 100 random samples of 3 data points (n = 3), and estimate 100 different model parameters. What I would expect then is that for about 51 of these samples I would have a good model (Case 1), and for 49 the estimated model would be contaminated by 1 or more outliers (Case 2 or 3). So, for about 51 samples, I would expect roughly 80% of the errors to be low, and 20% to be high (the outliers), whereas for the 49 “bad” samples I would expect mostly high errors.

Validating the thought process

Let’s try that out, on the same example we used in our previous posts. We generated a sample of 100 data points where:

let rng = System.Random 0
let probaNoisy = 0.5
let sample =
    Array.init 100 (fun _ ->
        let obs = { X = rng.NextDouble () }
        let y =
            if rng.NextDouble () < probaNoisy
            then 
                // outlier, random between 1.0 and 1.5
                1.0 + 0.5 * rng.NextDouble ()
            else
                // perfect model prediction
                predictor trueParameters obs
                // ... with added uniform noise, +/- 0.1
                + (0.2 * rng.NextDouble () - 0.1)
        {
            Observation = obs
            Label = y
        }
        )

We are interested in observing the prediction errors for different models. Because we are lazy, we will just tweak a bit the original code, like so:

let config: Configuration = {
    RNG = System.Random 0
    MinimumTrainingSampleSize = 2
    MinimumInlinersRequired = 70 // un-necessary here
    }

let evaluate
    (config: Configuration)
    (estimateParameters: Example<'Obs, 'Lbl> [] -> 'Param)
    (predictor: 'Param -> 'Obs -> 'Lbl)
    (data: Example<'Obs, 'Lbl> [])
    =
    // sample n random observations from data
    let maybeInliners =
        Array.init config.MinimumTrainingSampleSize (fun _ ->
            let index = config.RNG.Next data.Length
            data[index]
            )
    let maybeModel =
        maybeInliners
        |> estimateParameters
    data
    |> Array.map (fun ex ->
        predictor maybeModel ex.Observation - ex.Label
        |> abs
        )

We select 2 random examples from the data, estimate a model, and compute the error across the entire dataset.

All that’s left to do then is estimate 100 different models, each using a different random sample of 2 observations, and plot the errors. We will use the median error as a measurement of the overall error (the value such that half the errors are greater, and half are smaller), and sort the models by increasing median error:

open MathNet.Numerics.Statistics

// estimate 100 models and their error
let eval =
    Array.init 100 (fun _ ->
        sample
        |> evaluate config estimator predictor
        |> fun errors -> errors.Percentile(50)
        )
    |> Array.sort

eval
|> Array.mapi (fun i x -> i, x)
|> Chart.Line
|> Chart.show

The result is… not what I was hoping for:

Median prediction errors across 100 models, sorted in increasing order

What I was expecting was:

Instead, what we see is a gradual increase in errors for about 75% of the models, followed by a very steep increase for the last quarter. Stated differently, we don’t see a clear boundary between the clean models and the ones that are contaminated with outliers.

Is this a failure?

This is clearly not what I expected. I expected a clear cliff after the first 25 models, there is no such cliff. So yes, this is a failure.

This took me aback. After the usual cycle of denial, anger, bargaining and depression, I reached acceptance, and started thinking again. Why is it not working? After all, the thought process was fairly reasonable. Where is the flaw, and can we fix this?

After some thinking, my hypothesis was that there was simply too much noise in the data to cleanly distinguish between Case 1 and Case 2. The dataset is noisy in 2 different ways:

To confirm that this could be the source of the issue, I re-created the dataset, with a much smaller error band of + / - 0.01 around clean observations, like so:

    if rng.NextDouble () < probaNoisy
    then 
        // outlier generation unchanged
    else
        // perfect model prediction
        predictor trueParameters obs
        // ... with added uniform noise, +/- 0.01
        + (0.02 * rng.NextDouble () - 0.01)

I also had a hunch that the scale of the chart was potentially hiding meaningful differences, because the scale of large errors completely dwarfs the scale of the smaller errors, so I switched to a log scale, which produces the following chart:

Log of median prediction errors across 100 models, sorted in increasing order

Lo and behold, this is much better! We see a very clear rise around the 31st observation, roughly where we would expect it, somewhat close to 25.

Note: for the sake of completeness, I also tried to transform the original chart using a log scale. It still doesn’t show anything meaningful happening around the 25th observation. In other words, the failure is confirmed.

So where does this leave us? First, it was satisfying to see that my original idea was not entirely wrong. Then, it gave me a better understanding of the potential impact of errors. For RANSAC to work well, the outliers should be obvious outliers. The sample I used initially was particularly bad, because it contained many, many outliers, which weren’t all that far off from clean observations.

Is this useful? Maybe. Visually, I can pick up that around the 31st observation the errors degrade rapidly. From that I can deduce that 31% of the samples were probably purely “clean” observations, that is, Case 1. I can then infer that we probably have around 38% of outliers ((1 - p) * (1 - p) ~ 0.39, i.e. p ~ 0.38). It is not the best approximation, but it is not too bad either!

The issue, though, is that while this approach gives us a visual method to calibrate RANSAC, it does not lend itself well to full automation. I replaced the original problem (finding a decent value for d) by another problem, which is not trivial: detecting a cliff in a curve, assuming there is one.

I’ll leave it at that for now. I have another direction I want to try out, but it needs to simmer for a bit first. In the meantime, I hope you found something of interest in this post!

Do you have a comment or a question?
Ping me on Mastodon!