SIMD by cheating

Since the last post about SIMD library plans, I’ve been experimenting. Needless to say, it turned out a bit different than originally planned, but I’ve something I’d like to share. Maybe it’ll be useful for someone or maybe it’ll at least spark some more progress in the area.

If you don’t care about the chatter and just want to use it, it’s called slipstream and is available on crates.io. It’s an early release and will need some more work, but it can be experimented with (it has documentation and probably won’t eat any kittens when used). If you want to help out, scroll down to see what needs help (or decide on your own what part needs improving 😇).

Issues with the original plan

The original idea was to build types that would look like u32x16<Avx2>. These would have operators and methods implemented using the intrinsics from the core::arch module, so they would be fast. The user would write the performance critical code as a function generic over the instruction set and some kind of helper would choose the best instruction set and call the right instantiation of the function.

However, this has several drawbacks:

There are several base types (u32, u64, f16, …). There are several sizes of vectors one would like to have. And there are several instruction sets. There’s a lot of combinations of these and most of them need to have different implementation ‒ different intrinsics for different base types, for different instruction sets… The naming conventions would allow one to generate all these implementations somehow (I’ve tried generating them by the build.rs), but this blows up the compilation time. At one point in time, the crate was building for about 10 minutes (while not covering all the instruction sets yet). I’ve managed to restructure it enough to get down to about a minute, but it still felt too much.
There’s a catch to using the intrinsics. One can use eg. an AVX2-level intrinsic in whatever function and it’ll get inserted in there alright. But unless the function it gets inserted into is allowed to use the AVX2 instruction set too (either by compile time switches or by the target_feature attribute), the intrinsic can’t get inlined. That means the CPU will load let’s say 8 floats into an AVX2 register, then it’ll call a function containing a single AVX2 instruction to eg. add the 8 floats to another register and then it’ll return from that function. While the single instruction might be super fast, the overhead of calling a separate function for each such very fast instruction is huge and it just doesn’t work performance-wise. So, while using the intrinsics (either directly or in the form of wrapping types) one still needs to put the attribute onto the function. That makes the approach of using traits/generics unusable for this.
The code that was produced by using a „generic“ instruction set (one that doesn’t have any explicit intrinsics, where it just emulates the vector types of given size by doing bunch of scalar operations) was faster than the code produced with explicit instructions in many cases, or at least comparable in its performance.

Furthermore, I’ve looked into the source code of some of the intrinsics (and of the packed_simd types). They contain LLVM intrinsics such as simd_add. So, wait a moment… I go to lengths to abstract away the differences between the instruction sets to hide the fact I have separate intrinsics for each of them. But then, the compiler just throws it away and replaces the separate intrinsics with a vague „This is addition of such-an-such large vector, figure out some SIMD in there“. Then it again looks at what instruction set it has enabled and picks something as it sees best.

Shouldn’t there be a more direct way to „You have these instructions and a vector, figure out some SIMD in there“? And it turns out there is.

The compiler has an optimizing part (is it called a pass?) called auto-vectorizer. This thing tries its best to replace ordinary scalar instructions with vector ones where it makes sense. And it is pretty good at it. However, it needs to first prove it can replace these 8 scalar additions with single vector addition.

One can do it by having this simd_add intrinsic, that is not exposed to mere ~~mortals~~ crates. But in fact, it is enough if the compiler sees that there’ll be exactly 8 scalar additions (or how many scalar instructions that particular vector instruction corresponds to) and that the 8 inputs and 8 outputs will be well aligned in memory. It can figure out some more complex things too, like shuffles and some masked instructions. It probably can’t do all the weird instructions that are available (I still haven’t seen it produce a gather, for example, and I haven’t even tried this drunkenly swaying one).

Slipstream cheats

So, instead of going through the very indirect way of intrinsics, I’ve decided to write the vector types in a way the auto-vectorizer would like them and then let the rest for it. In principle, a u32x16 is „just“ a well-aligned [u32; 16] and its addition operator is „just“ a for cycle over the elements of the two input same-sized arrays. There are some grudging uninteresting details, like using the generic-array, annotating the operator methods with #[inline] and such, but the idea is the same. This simple thing makes the code much more tasty to the auto-vectorizer and it does the real magic.

That’s why the crate is called slipstream ‒ similar to the real thing, it doesn’t make anything go faster by applying any force to it, it just removes some friction and the engine can work better.

Runtime instruction set detection

You might be asking by now where the idea of using the best instruction set the CPU knows went.

So if you need to use the target_feature annotations anyway even if you use the intrinsics, you just get the instruction sets from there in the auto-vectorized code as well.

And after publishing the previous post, I got contacted by Caleb Zulawski. It turns out he wrote a great crate to solve the awkwardness of using the target_feature thing called multiversion. So slipstream doesn’t offer any direct support for dealing with instruction sets, but it works quite well together with multiversion. The documentation and the repository contains some examples of the use.

What if one still wants to be explicit about the vectorization

If you really wanted to be explicit, the fact I’ve gave up on that part might be a disappointment. I’d direct you to the safe_simd crate-in-development, also from Caleb Zulawski. Originally, we planned to somehow unify our efforts eventually, but I guess my change of mind about going the auto-vectorized way put a stop to that. I still think both experiments are worthwhile, so give his crate a glimpse too.

How good is it, really

I’ve tried to write some benchmarks (they are in the repository). Both from the results that are there and from the process of writing them, I reached the conclusion that „it varies“.

In most cases, I’ve found some way to write the code around the vectors that produces result of the same speed as manual vectorization by directly using the intrinsics, sometimes even faster. So on one side, it is possible to really improve the performance of the program by using the library.

On the other hand, a lot of versions were slower than the best one and sometimes slower than the „naïve“ scalar implementation. There were several reasons for the slowness ‒ sometimes things didn’t get inlined properly, „losing“ the permission to use better vector instructions. Sometimes, on some CPUs (I’m looking at you, AMD Buldozer) using one AVX2 instruction is slower than using two SSE instructions. Sometimes the code got vectorized, but for whatever reason the compiler started to juggle between the registers ‒ the code was full of move instructions, and even though these were vector move instructions, they added overhead. Sometimes it got weirdly half-vectorized ‒ some operations were vectorized and some weren’t.

I’ve seen the same code, compiled by the same compiler, be significantly faster than the naïve one on one CPU and slower on another. Smells of arcana a bit.

I don’t know how stable the results are across different compiler versions.

Therefore, when using the library, some experimentation and benchmarking is needed.

However, I have an intuition that some of these problems would still be happening with manual intrinsic use or with explicit vectorization library. I also believe some of these problems can be improved ‒ eg. by having better control over inlining. And maybe someone is able to twist the internals of the vector types a little to produce better results, or to produce them with better odds.

Help wanted

There’s some actual „work“ to do on the code itself ‒ it’s rigged with TODO comments and such. The API surface definitely is not complete. Some of these are listed in the crate documentation.

But these are the things I’ll probably get to over time myself if nobody else does it first (I’ll still appreciate the help, though, and with more people it’d get done sooner). The most valuable thing you can help with is trying it out, reporting what works, what doesn’t ‒ in the form of issues or by submitting more benchmarks. As said above, I don’t really know if this approach is going to be usable in bigger programs and if it would be practically useful, or if it’s only good enough for summing up a big array of numbers.

In other words, I don’t know your use cases, but I’d like to support them if reasonably possible. Or at least find out that this approach is a dead end.

Actually, what I would really like is for packed_simd to become usable on stable Rust so this thing is no longer needed. I’ll be happy to deprecate the thing in favor of packed_simd or something like that eventually.