RANSAC: estimating a model using very noisy data (part 2)

In our previous installment, we set up the stage for our exploration of the RANSAC algorithm, illustrating how traditional linear regression could perform quite badly on a dataset with many suspect observations. In a nutshell, the estimation penalizes large prediction errors more heavily, and as a result, it can over compensate for a few very large errors (so called “outliers”), with a poor model overall.

This issue is particularly prevalent with “standard” linear regression, because it evaluates goodness of fit using the square of the prediction errors, which mechanically penalizes heavily larger errors. However, it applies beyond that specific case. Most estimation techniques proceed by minimizing some measurement of prediction error, and, as large errors are worse than small ones, will usually penalize large errors more heavily, introducing the same issue of sensitivity to outliers.

The RANSAC algorithm attempts to address that issue, using a few simple ideas together. In pseudo-code, slightly modified from the Wikipedia version, the key part of the algorithm goes like this:

- Take a small random subset of the data available, hoping it contains mostly 
"clean" observations, so-called inliners,
- Estimate a prediction model on that small sample,
- Use that prediction model on the entire data. If the error on an observation 
is too large, discard it as an outlier, otherwise keep the observation as a 
potential inliner,
- If the overall number of inliners is too small (i.e. we have too many 
outliers), this model is a bad fit, discard it,
- Otherwise estimate a model on the inliners, and return the result.
More...

RANSAC: estimating a model using very noisy data (part 1)

Recently, as part of a project I am working on, I had to estimate a model to make some predictions. And, as is usually the case, the data available was not very good, with a lot of suspect data points, so called “outliers”. Estimating the parameters of a model becomes tricky then, because a few data points that are very wrong can have an outsized impact on the parameters, and tilt the results of the estimation far off the correct values.

This lead me to the RANSAC algorithm, which is designed to handle that exact problem. And, in my experience, re-implementing an algorithm from scratch is a great way to really understand how it works, so that’s what I did: you can find the current code here.

In this post, I will motivate the problem first, demonstrating how outliers can wreck model estimation. In our next installment, we will go over RANSAC and see if it fares any better!

Let’s illustrate the problem on a simple example. Imagine that we have a model, where we observe a value X and a value Y, and that these 2 follow a simple relationship where

Y = 1 + 0.5 * X

That is, they form a straight line. Now imagine that we have a dataset with a sample of observations for X and Y, with 2 caveats:

  • In half the cases, the observation is completely wrong, and there is no relationship between X and Y,
  • When the observation is correct, the observation is a bit off, and the recorded values are Y = 1 + 0.5 * X + some noise.

Let’s first create a model to represent that:

type Obs = { X: float }

type Parameters = {
    Constant: float
    Slope: float
    }

let predictor (parameters: Parameters) =
    fun (obs: Obs) ->
        parameters.Constant
        + parameters.Slope * obs.X

let trueParameters = {
    Constant = 1.0
    Slope = 0.5
    }

We have observations Obs and model Parameters. The “true” model has parameters { Constant = 1.0; Slope = 0.5 }, which we can use to predict the value Y like so:

predictor trueParameters { X = 1.0 }
val it: float = 1.5
More...

How much of a burden is management?

From time to time, a small and usually unimportant question gets stuck in my head, and won’t stop nagging me until I spend the time to figure out the answer. This post is about one of these.

Last December, I mentioned in a post that “bigger teams require more coordination”, which would lead to diminishing returns to scale. As a team grows, it requires managers to coordinate activities. Grow the team more, and managers themselves require coordination, and another layer of managers, and so on and so forth. My intuition was that, ignoring other effects, larger teams would progressively become less and less productive, because of the increased burden of management.

The question that got stuck in my head was, how does this burden grow? What shape does it have, and can I express it as a mathematical function? This is the question I will be investigating in this post.

The way I approached the question started by formulating a simplistic model, without being too concerned about the finer details. In my world, any time we form a group of a certain size, a coordinator (a manager, if you will) is needed. These coordinators do not directly contribute to the actual work, but they are necessary for the workers to perform their work.

As an illustration, let’s imagine that this group size is 5. In this case,

  • 1 worker can work without coordination,
  • 4 workers can work without coordination,
  • 5 workers require 1 coordinator (we reached a group size of 5),
  • 6 workers require 1 coordinator (we have 1 group of 5, and an unsupervised worker),
  • 25 workers require 5 coordinators, who themselves require 1 coordinator.

… and so on and so forth: 125 workers (5*5*5) require another layer of coordinators, 625 yet another layer, etc…

So how does it look? Let’s write a function, computing the coordination overhead needed for a given number of workers:

let groupSize = 5

let overhead (workers: int) =
    workers
    |> Seq.unfold (fun population ->
        if population >= groupSize
        then
            let managers = population / groupSize
            Some (managers, managers)
        else None
        )
More...

Adding Goal Seek to Quipu (and helping Santa with it!)

This post is part of the F# Advent 2025 series, which already has bangers! Check out the whole series, and a big shout out to Sergey Tihon for organizing this once again!

It is that merry time of the year again! The holidays are approaching, and in houses everywhere, people are happily sipping eggnogg and hanging decorations. But in one house, the mood is not festive. Every year on December 1st, the first day of Advent, Santa Claus begins wrapping gifts for 2 billion children worldwide, from his workshop in the North Pole. But this year, Krampus unexpectedly decided to impose tariffs on Greenland, throwing the supply chain of gifts into chaos. It is now December 11th, and Santa just received the goods. Santa is now 11 days behind schedule, and needs to hire many, many more Elves than usual to catch up. But… how many Elves does he need to hire?

Santa runs a tight ship at Santa, Inc., and he knows that adding new Elves to the team won’t be seamless. Bigger teams require more coordination and additional equipment.

Based on available data, Santa knows that:

  • He needs to hire Elves to wrap 2,000,000,000 gifts,
  • A single Elf can wrap at most 100,000 gifts a day,
  • Instead of the normal 24 Advent days, Santa has only 13 days left,
  • A team of n Elves will only be able to produce n ^ 0.8 as much as a single elf, that is, there are diminishing returns to scale.

Note: the function f(elves) = elves ^ 0.8 is largely arbitrary. It has the shape we want for our problem: it is always increasing (more elves can wrap more gifts), but the increase slows down gradually. For instance f(1)=1.00, whereas f(2)=1.74, meaning that 2 elves will only be able to wrap 1.74 as many gifts as 1 elf, instead of twice as many.

Can we help Santa decide how many Elves to hire? And can we figure out how much the Krampus shenanigans are costing Santa, Inc.? We certainly can, and today we will do so using the goalSeek feature which we just added to Quipu.

More...

Count distinct items with the CVM algorithm

I came across this post on the fediverse the other day, pointing to an interesting article explaining the CVM algorithm. I found the algorithm very intriguing, and thought I would go over it in this post, and try to understand how it works by implementing it myself.

The CVM algorithm, named after its creators, is a procedure to count the number of distinct elements in a collection. In most situations, this is not a hard problem. For example, in F#, one could write something like this:

open System

let rng = Random 42
let data = Array.init 1_000_000 (fun _ -> rng.Next(0, 1000))
data
|> Array.distinct
|> Array.length

val it: int = 1000

We create an array filled with 1 million random numbers between 0 and 999, and directly extract the distinct values, which we then count. Easy peasy.

However, imagine that perhaps your data is so large that you can’t just open it in memory, and perhaps even the distinct items you are trying to count are too large to fit in memory. How would you go about counting the distinct items in your data then?

The CVM algorithm solves that problem. In this post, we will first write a direct, naive implementation of the algorithm as presented in the paper, and try to discuss why it works. Then we’ll test it out on the same example used in the article, counting the words used in Hamlet.

More...