Attempting to auto-tune RANSAC

In my previous posts, I looked into the RANSAC algorithm. One thing I wondered about is, could the algorithm usage be made simpler by automatically tuning some of its input parameters? That is, rather than requiring the user to enter parameters, can we derive reasonable values for these parameters from the data? In this post, I will go over my first attempt. This was not a success, in that it did not produce immediately useful results, but it gave me a better understanding of the algorithm, and the process was interesting, so that’s what this post will focus on.

In the algorithm as described in Wikipedia, 2 parameters stand out in particular:

  • t, “A threshold value to determine data points that are fit well by the model (inlier)”,
  • d, “The number of close data points (inliners) required to assert that the model fits well to the data”.

My thinking here was that, given a particular model, I should be able to derive these numbers from the data itself. t, the threshold value, is about how far predictions are from the correct value, something we can measure. d is about how many predictions should be close to their actual values overall. It is related to a different measure, the proportion of outliers. If I had, say, 20% of outliers in my sample, I would expect roughly 80% of my model predictions to be close to the correct value.

RANSAC assumes that you, the user, will input these values. What I wondered
about is, if all I had was a model and a dataset, how I would go about evaluating the value of d. Or, stated differently, could I estimate p(outliers), the proportion of outliers in a dataset?

More...

RANSAC: estimating a model using very noisy data (part 2)

In our previous installment, we set up the stage for our exploration of the RANSAC algorithm, illustrating how traditional linear regression could perform quite badly on a dataset with many suspect observations. In a nutshell, the estimation penalizes large prediction errors more heavily, and as a result, it can over compensate for a few very large errors (so called “outliers”), with a poor model overall.

This issue is particularly prevalent with “standard” linear regression, because it evaluates goodness of fit using the square of the prediction errors, which mechanically penalizes heavily larger errors. However, it applies beyond that specific case. Most estimation techniques proceed by minimizing some measurement of prediction error, and, as large errors are worse than small ones, will usually penalize large errors more heavily, introducing the same issue of sensitivity to outliers.

The RANSAC algorithm attempts to address that issue, using a few simple ideas together. In pseudo-code, slightly modified from the Wikipedia version, the key part of the algorithm goes like this:

- Take a small random subset of the data available, hoping it contains mostly 
"clean" observations, so-called inliners,
- Estimate a prediction model on that small sample,
- Use that prediction model on the entire data. If the error on an observation 
is too large, discard it as an outlier, otherwise keep the observation as a 
potential inliner,
- If the overall number of inliners is too small (i.e. we have too many 
outliers), this model is a bad fit, discard it,
- Otherwise estimate a model on the inliners, and return the result.
More...

RANSAC: estimating a model using very noisy data (part 1)

Recently, as part of a project I am working on, I had to estimate a model to make some predictions. And, as is usually the case, the data available was not very good, with a lot of suspect data points, so called “outliers”. Estimating the parameters of a model becomes tricky then, because a few data points that are very wrong can have an outsized impact on the parameters, and tilt the results of the estimation far off the correct values.

This lead me to the RANSAC algorithm, which is designed to handle that exact problem. And, in my experience, re-implementing an algorithm from scratch is a great way to really understand how it works, so that’s what I did: you can find the current code here.

In this post, I will motivate the problem first, demonstrating how outliers can wreck model estimation. In our next installment, we will go over RANSAC and see if it fares any better!

Let’s illustrate the problem on a simple example. Imagine that we have a model, where we observe a value X and a value Y, and that these 2 follow a simple relationship where

Y = 1 + 0.5 * X

That is, they form a straight line. Now imagine that we have a dataset with a sample of observations for X and Y, with 2 caveats:

  • In half the cases, the observation is completely wrong, and there is no relationship between X and Y,
  • When the observation is correct, the observation is a bit off, and the recorded values are Y = 1 + 0.5 * X + some noise.

Let’s first create a model to represent that:

type Obs = { X: float }

type Parameters = {
    Constant: float
    Slope: float
    }

let predictor (parameters: Parameters) =
    fun (obs: Obs) ->
        parameters.Constant
        + parameters.Slope * obs.X

let trueParameters = {
    Constant = 1.0
    Slope = 0.5
    }

We have observations Obs and model Parameters. The “true” model has parameters { Constant = 1.0; Slope = 0.5 }, which we can use to predict the value Y like so:

predictor trueParameters { X = 1.0 }
val it: float = 1.5
More...

How much of a burden is management?

From time to time, a small and usually unimportant question gets stuck in my head, and won’t stop nagging me until I spend the time to figure out the answer. This post is about one of these.

Last December, I mentioned in a post that “bigger teams require more coordination”, which would lead to diminishing returns to scale. As a team grows, it requires managers to coordinate activities. Grow the team more, and managers themselves require coordination, and another layer of managers, and so on and so forth. My intuition was that, ignoring other effects, larger teams would progressively become less and less productive, because of the increased burden of management.

The question that got stuck in my head was, how does this burden grow? What shape does it have, and can I express it as a mathematical function? This is the question I will be investigating in this post.

The way I approached the question started by formulating a simplistic model, without being too concerned about the finer details. In my world, any time we form a group of a certain size, a coordinator (a manager, if you will) is needed. These coordinators do not directly contribute to the actual work, but they are necessary for the workers to perform their work.

As an illustration, let’s imagine that this group size is 5. In this case,

  • 1 worker can work without coordination,
  • 4 workers can work without coordination,
  • 5 workers require 1 coordinator (we reached a group size of 5),
  • 6 workers require 1 coordinator (we have 1 group of 5, and an unsupervised worker),
  • 25 workers require 5 coordinators, who themselves require 1 coordinator.

… and so on and so forth: 125 workers (5*5*5) require another layer of coordinators, 625 yet another layer, etc…

So how does it look? Let’s write a function, computing the coordination overhead needed for a given number of workers:

let groupSize = 5

let overhead (workers: int) =
    workers
    |> Seq.unfold (fun population ->
        if population >= groupSize
        then
            let managers = population / groupSize
            Some (managers, managers)
        else None
        )
More...

Adding Goal Seek to Quipu (and helping Santa with it!)

This post is part of the F# Advent 2025 series, which already has bangers! Check out the whole series, and a big shout out to Sergey Tihon for organizing this once again!

It is that merry time of the year again! The holidays are approaching, and in houses everywhere, people are happily sipping eggnogg and hanging decorations. But in one house, the mood is not festive. Every year on December 1st, the first day of Advent, Santa Claus begins wrapping gifts for 2 billion children worldwide, from his workshop in the North Pole. But this year, Krampus unexpectedly decided to impose tariffs on Greenland, throwing the supply chain of gifts into chaos. It is now December 11th, and Santa just received the goods. Santa is now 11 days behind schedule, and needs to hire many, many more Elves than usual to catch up. But… how many Elves does he need to hire?

Santa runs a tight ship at Santa, Inc., and he knows that adding new Elves to the team won’t be seamless. Bigger teams require more coordination and additional equipment.

Based on available data, Santa knows that:

  • He needs to hire Elves to wrap 2,000,000,000 gifts,
  • A single Elf can wrap at most 100,000 gifts a day,
  • Instead of the normal 24 Advent days, Santa has only 13 days left,
  • A team of n Elves will only be able to produce n ^ 0.8 as much as a single elf, that is, there are diminishing returns to scale.

Note: the function f(elves) = elves ^ 0.8 is largely arbitrary. It has the shape we want for our problem: it is always increasing (more elves can wrap more gifts), but the increase slows down gradually. For instance f(1)=1.00, whereas f(2)=1.74, meaning that 2 elves will only be able to wrap 1.74 as many gifts as 1 elf, instead of twice as many.

Can we help Santa decide how many Elves to hire? And can we figure out how much the Krampus shenanigans are costing Santa, Inc.? We certainly can, and today we will do so using the goalSeek feature which we just added to Quipu.

More...