More SIMD vectors exploration

13 Oct 2024

Since my earlier post looking into SIMD vectors in .NET, I attempted a few more experiments, trying to understand better where they might be a good fit, and where they would not.

The short version: at that point, my sense is that SIMD vectors can be very handy for some specific scenarios, but would require quite a bit of work to be usable in the way I was hoping to use them. This statement is by no means meant as a negative on SIMD; rather, it reflects my realization that a SIMD vector is quite different from a mathematical vector.

All that follows should also be taken with a big grain of salt. SIMD is entirely new to me, and I might not be using it right. While on that topic, I also want to thank @xoofx for his very helpful comments - much appreciated!

Anyways, with these caveats out of the way, let’s dive into it.

My premise approaching SIMD was along these lines: “I write a lot of code that involves vectors. Surely, the Vector class should be a good fit to speed up some of that code”.

To explore that idea, I attempted to write a few classic vector operations I often need, both in plain old F# and using SIMD vectors, trying to benchmark how much performance I would gain. In this post, I’ll share some of the results, and what I learnt in the process.

You can find the whole code here on GitHub.

Context: what am I trying to do here?

First, what do I mean by “code that involves vectors”? Most of my work involves
numerical algorithms, be it machine learning, optimization, or simulation. These can often be represented in a general manner as operations on mathematical vectors, involving a few core operations that appear over and over again. Below are a few examples for illustration:

distance: computing the difference (or similarity) between 2 entities, represented as vectors of values / features,
linear combination: computing the average position of a collection of entities, represented as vectors of values / features,
translation: moving from a candidate solution to a better one,
statistics: computing aggregate values over a sample of observations, such as the average, standard deviation, or log likelihood.

For instance, I have written code to compute the Euclidean distance multiple times in different projects, something along these lines (ignoring details such as what happens if v1 and v2 do not having matching lengths):

let distance (v1: float[], v2: float[]) =
    (v1, v2)
    ||> Array.map2 (fun x y -> (x - y) ** 2)
    |> Array.sum
    |> sqrt

We take 2 arrays of numbers, compute the difference between each of their elements, square it, sum it all together and return the square root. By using arrays, we get a very general function that it will work for any array length - we get the same re-usable operation regardless of the array length.

From a mathematical standpoint, we are using arrays of floats here as a representation of a vector. The same operation, viewed from a mathematician standpoint, could be written along the lines of:

$ (v_1, v_2) \rightarrow { \sqrt { (v_1 - v_2) \cdot (v_1 - v_2) } } $

My point here is that while the F# example above is fairly general, it is still a very manual implementation of 2 algebra operations:

the difference between 2 vectors, and
the dot-product of 2 vectors.

These 2 operations are not immediately obvious in the code - it would be nice to write the distance as an operation on vectors, using these “core” vector operators (difference and dot-product) instead, something along these lines:

let distance (v1: float[], v2: float[]) =
    let V1 = vector v1
    let V2 = vector v2
    (V1 - V2) .* (V1 - V2)
    |> sqrt

This was what motivated me to look into the SIMD-accelerated Vector class in .NET. I was hoping for a way to both accelerate my code using SIMD, and write code directly using algebra operators.

An important constraint here is that I want library code, code that I can plug into existing code to leverage vectors when appropriate. In that frame, I can’t impose SIMD Vector on the consumer. I want to be able to seamlessly pass in arrays, and return arrays. The SIMD Vector part should be transparent to the consumer.

Fixed size Vectors

My first attempt into this was the exact example I showed before, a distance function. With an assist from @xoofx, this is where I landed:

let distance (v1: float[], v2: float[]) =

    let s1 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v1))
    let s2 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v2))

    let mutable total = 0.0
    for i in 0 .. (s1.Length - 1) do
        let v1 = s1.[i]
        let v2 = s2.[i]
        let diff = v1 - v2
        total <- total + Vector.Dot(diff, diff)
    sqrt total

The good news first: this is vastly faster than the original version. For vectors of size 8, it takes ~2% of the time the naive F# implementation:

| Method           | Mean       | Error     | StdDev    | Ratio | RatioSD |
|----------------- |-----------:|----------:|----------:|------:|--------:|
| classic          | 212.753 ns | 3.3103 ns | 2.9345 ns |  1.00 |    0.02 |
| simdV3           |   3.343 ns | 0.0967 ns | 0.0950 ns |  0.02 |    0.00 |

With vectors of size 4,000, we get even better results, running under 1% of the original version:

| Method           | Mean         | Error       | StdDev      | Ratio | RatioSD |
|----------------- |-------------:|------------:|------------:|------:|--------:|
| classic          | 115,976.9 ns | 2,315.62 ns | 2,166.03 ns | 1.000 |    0.03 |
| simdV3           |     967.4 ns |    13.64 ns |    11.39 ns | 0.008 |    0.00 |

Needless to say, at that point, my interest was piqued - who wouldn’t want a 98% speed improvement in their code?

However, some issues are already showing up. First, a Vector<float> is quite different from a mathematical vector: it has a fixed size, which is architecture dependent.

On my machine, a Vector<float> has a size of 4. This has 2 implications:

First, I need to re-write vector operations into an equivalent operation breaking it down by blocks of 4 small vectors of values, which I need to aggregate back up. This requires making sure the operation is associative. As an example, in the distance function, I kept the square root calculation “outside”, because $ \sqrt { x + y } \neq \sqrt { x } + \sqrt { y } $.

Even if the operation is associative mathematically, it might not be from a numeric computation standpoint, because of rounding errors. As an example, addition is associative: $ a + (b + c) = (a + b) + c $. However, depending on how you perform addition on floats, you might get different results, as we can show with a simple example:

open System

let rng = Random 0

let data = Array.init 16 (fun _ -> rng.NextDouble())

// directly summing all the numbers together:
let sum = data |> Array.sum
// summing the numbers by groups of 4,
// then summing the groups together:
let sumByBlocks =
    data
    |> Array.chunkBySize 4
    |> Array.map (fun chunk -> chunk |> Array.sum)
    |> Array.sum

printfn $"Equal: {sum = sumByBlocks} ({sum}, {sumByBlocks})"

Equal: False (9.108039789417779, 9.108039789417777)

Granted, this is a small difference, at the 15th decimal. However, it only took 16 values and creating groups of 4 to find a discrepancy, which is exactly what using SIMD vectors would do. In other words, even for simple operations like addition or multiplication, we cannot guarantee that a SIMD version of the operation would produce the same result as the array version. Whether this matters depends on context - but at the same time, errors do accumulate.

Second, the SIMD version of distance above is incomplete. It will work great, as long as the size of the 2 arrays are perfect multiples of 4. If not, it will crash, because the tail end of the array does not form a perfect block of 4 values, which is needed to construct a Vector<float>.

We can address this issue, by making sure that we only process complete blocks of size 4 using SIMD vectors, and handle the remaining tail end as a regular array. As an example, for arrays of size 10, we would do something like this:

input array: [ 0; 1; 2; 3; 4; 5; 6; 7; 8; 9 ]
2 SIMD vectors using SIMD code: [ 0; 1; 2; 3 ], [ 4; 5; 6; 7 ]
1 remainder array using naive code: [ 8; 9 ]
aggregate the results

This is perfectly feasible, but the resulting distance function is starting to look a bit messy:

let full (v1: float[], v2: float[]) =

    let remainingBlocks = v1.Length % Vector<float>.Count
    let fullBlocks =
        (v1.Length - remainingBlocks) / Vector<float>.Count

    let s1 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v1))
    let s2 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v2))

    let mutable total = 0.0
    for i in 0 .. (fullBlocks - 1) do
        let v1 = s1.[i]
        let v2 = s2.[i]
        let diff = v1 - v2
        total <- total + Vector.Dot(diff, diff)

    if remainingBlocks > 0
    then
        for i in (v1.Length - remainingBlocks) .. (v1.Length - 1) do
            total <- total + (pown (v1.[i] - v2.[i]) 2)

    sqrt total

Good news: the performance is still excellent. Bad news: the code has gotten more complex, and performs the same operation using 2 different approaches, completely obscuring the underlying algebra. It is arguably worth it, given the speedup, but it is far from my original hope of code that looks like algebra.

Going beyond distance calculations

In spite of these minor issues, at that point I was still quite interested in exploring further. While I like code to look pretty, for that type of speedup, I will happily accept compromises!

However, as I attempted to explore other functions I use regularly, I wasn’t quite as successful as with my first example, and even ran into some issues I do not fully understand.

Note: the following examples all ignore the problem of vectors that are not clean multiples of Vector<float>.Count. My goal here was to get a sense for how much performance I might gain at best by using SIMD.

The first case was performing a translation, taking a step from a vector towards another vector. My starting point, the naive F# version, looks like this:

let naive (origin: float[], target: float[], coeff: float) =
    let dim = origin.Length
    Array.init dim (fun col ->
        origin.[col] + coeff * (target.[col] - origin.[col])
        )

The motivation for this one was my Nelder-Mead solver, Quipu, which relies heavily on that operations. The idea here is to take a starting point, origin, and take a step of a certain size, coeff, towards a target, to evaluate if moving in that direction gives us a better objective function value.

I attempted multiple SIMD versions, the best one I got was this one:

let take1 (origin: float[], target: float[], coeff: float) =

    let origins = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(origin))
    let targets = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(target))

    let result = Array.zeroCreate<float> origin.Length
    let blockSize = Vector<float>.Count

    let multiplier = 1.0 - coeff
    for i in 0 .. (origins.Length - 1) do
        let o = origins.[i]
        let t = targets.[i]
        let movedTo = Vector.Multiply(multiplier, o) -  t
        movedTo.CopyTo(result, i * blockSize)
    result

| Method  | Mean      | Error    | StdDev   | Ratio | RatioSD |
|-------- |----------:|---------:|---------:|------:|--------:|
| classic |  85.90 ns | 0.430 ns | 0.359 ns |  1.00 |    0.01 |
| simdV1  |  43.66 ns | 0.914 ns | 1.783 ns |  0.51 |    0.02 |

Still a decent improvement over the naive version, but nothing close to the massive speedup we saw on the distance calculation. My intuition is that there are 2 relevant differences here:

The SIMD operations are less efficient than in the distance case, where the dot product does a lot in a single operation.
More importantly, besides the individual operations on elements, we now also have to copy back the results in the array we are returning.

As a side note, I would be super interested to hear if there is a better way to copy back data to an array than what I did!

A 50% improvement in speed is nothing to sneeze at. However, this hints at different gains to be expected, depending on whether the operation returns a vector or a scalar, a value aggregated / folded over the input vector.

Another operation I looked into was computing the average of an array of values. The code is very similar to what we landed on for the distance:

let naive (v: float[]) =
    v
    |> Array.average

let average (v: float[]) =

    let vectors = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan v)
    let mutable total = 0.0
    for i in 0 .. (vectors.Length - 1) do
        let v = vectors.[i]
        total <- total + Vector.Sum(v)
    total / float v.Length

The performance is exactly what we would expect from SIMD:

| Method  | Mean      | Error     | StdDev    | Ratio |
|-------- |----------:|----------:|----------:|------:|
| classic | 28.843 us | 0.3332 us | 0.3117 us |  1.00 |
| simd    |  7.207 us | 0.0267 us | 0.0223 us |  0.25 |

This is an aggregate over the data, which involves computing a sum. Instead of adding elements one by one, we sum them 4 by 4, and as a result the computation takes only 25% of the original.

However, just as I thought I had a decent understanding of what to expect from aggregates, I tried to implement another operation I use a lot, the Log-Likelihood. It is largely similar to the distance and average, in that it computes an aggregate over the entire array, and so I was expecting a comparable performance improvement:

let naive (v: float[]) =
    v
    |> Array.sumBy (fun x -> log x)
    |> fun total -> - total

let take1 (v: float[]) =

    let s = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v))

    let mutable total = 0.0
    for i in 0 .. (s.Length - 1) do
        let v = s.[i]
        total <- total - Vector.Sum(Vector.Log(v))
    total

As it turns out, my expectations were totally wrong:

| Method  | Mean      | Error     | StdDev    | Ratio | RatioSD |
|-------- |----------:|----------:|----------:|------:|--------:|
| classic |  2.280 us | 0.0443 us | 0.0510 us |  1.00 |    0.03 |
| simdV1  | 25.672 us | 0.4196 us | 0.3720 us | 11.27 |    0.29 |

The SIMD version is 11 times slower than the naive version!?

I do not understand what is going on here. The likely culprit is Vector.Log. In order to use it, I had to upgrade to the dotnet 9 preview, because dotnet 8.0 did not offer it. Perhaps the implementation in preview is not quite there yet, or perhaps the instruction is not supported on my CPU. Beyond the performance issue, what I don’t like is that I have no idea where to start to understand the root cause, or what tools could be helpful here.

The final function I attempted was a linear combination over a collection of arrays. A good example of where you might use it would be computing the middle position of a collection of points (their barycenter). If you had n such points, represented by vectors $V_1, V_2, … V_n$ their average position would be $ \frac {1} {n} \times (V_1 + V_2 + … V_n) $.

More generally, a linear combination of n vectors would be something like $ c_1 \times V_1 + c_2 \times V_2 + … c_n \times V_n $, where $c_i$ is a scalar and $V_i$ is a vector.

The naive F# version would look something like this, iterating by columns:

let naive (vectors: (float * float[]) []) =
    let dim = (vectors.[0] |> snd).Length
    Array.init dim (fun col ->
        vectors
        |> Seq.sumBy (fun (coeff, v) -> coeff * v.[col])
        )

However, I could not manage to get something similar working using SIMD vectors. This is likely due to my lack of familiarity with Spans, but I got stuck on trying to create a collection I could work with in a similar fashion. Any hints on how to get that working would be super welcome!

So what? Parting words.

So where does this leave me?

First, I was lucky picking up the distance calculation as a starting point. This showed me that SIMD has the potential for very large calculation speedups. Had I started with another example, with less impressive results, I would probably not have spent that much time digging :)

My main take at the moment is that Vector<float> is quite different from what I expected a vector to be. Specifically, its fixed-size nature makes it tricky to compose efficiently. We can write specialized functions to perform one task efficiently, but what I would like to be able to do is work with composable primitives. As an example, consider this earlier example, translation:

let take1 (origin: float[], target: float[], coeff: float) =

    let origins = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(origin))
    let targets = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(target))

    let result = Array.zeroCreate<float> origin.Length
    let blockSize = Vector<float>.Count

    let multiplier = 1.0 - coeff
    for i in 0 .. (origins.Length - 1) do
        let o = origins.[i]
        let t = targets.[i]
        let movedTo = Vector.Multiply(multiplier, o) -  t
        movedTo.CopyTo(result, i * blockSize)
    result

Fundamentally, what we are doing here is $(1-coeff) \times origin - target$. This requires 2 operations: scalar x vector, and vector substraction. We could achieve that writing code along these lines:

let sub (v1: float [], v2: float []) =

    let result = Array.zeroCreate<float> v1.Length
    let blockSize = Vector<float>.Count

    let s1 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v1))
    let s2 = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v2))

    for i in 0 .. (s1.Length - 1) do
        let diff = s1.[i] - s2.[i]
        diff.CopyTo(result, i * blockSize)
    result

let mult (scalar: float, v: float []) =

    let result = Array.zeroCreate<float> v.Length
    let blockSize = Vector<float>.Count

    let s = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(v))

    for i in 0 .. (s.Length - 1) do
        let multiplied = scalar * s.[i]
        multiplied.CopyTo(result, i * blockSize)
    result

let take4  (origin: float[], target: float[], coeff: float) =

    sub (mult (1.0 - coeff, origin), target)

This works, but the performance, unsurprisingly, is not good, a bit worse than the naive F# version:

| Method  | Mean      | Error    | StdDev   | Ratio | RatioSD |
|-------- |----------:|---------:|---------:|------:|--------:|
| classic |  78.55 ns | 1.613 ns | 1.920 ns |  1.00 |    0.03 |
| simdV1  |  43.34 ns | 0.921 ns | 1.164 ns |  0.55 |    0.02 |
| simdV4  |  84.86 ns | 1.202 ns | 1.065 ns |  1.08 |    0.03 |

The problem here is that each time we perform an operation (sub, mult), we incur a cost, converting from arrays to SIMD vectors and back. What we would need here is a mechanism to keep chaining together SIMD Vector operations when appropriate, without getting in-and-out of the SIMD world in-between.

I suspect this can be achieved, and I might give that a try in a future post. In the meanwhile, I will stop here for today, and use SIMD only for hand-written specialized operations!

You can find the code for this post here on GitHub.

Do you have a comment or a question?
Ping me on Mastodon!