Using const generics in slipstream

Some time ago, I was experimenting with SIMD „by cheating“. The library (called slipstream) offers the vector types. These are little fixed sized arrays which correspond to the registers in the CPU.

Unlike the „real“ libraries (packed_simd), it doesn’t however force the SIMD by explicitly using the compiler intrinsics. The vector types are really only fixed sized arrays with the right methods and with forced alignment properties. The compiler has enough information to prove it can vectorize the code and oftentimes does so in a good enough way.

The advantage is there’s much less „magic“ inside the library code and it works on stable. The disadvantage, the auto-vectorizer doesn’t always do the best job and can even make the code somewhat slower than the original.

The recommended way is to combine the library with something that generates multiple versions of the functions and picks the right one by runtime CPU feature selection, like the multiversion crate.

The challenge

Since the size of the registers depends on the CPU feature level, it is not known in advance. It is possible to use larger ones (and the compiler will just emulate it by using multiple registers and operations). But as there are several register sizes and several basic types, each combination with its own alignment, there’s a large list of combinations that make sense.

The user is presented with convenient type aliases, like f32x4 ‒ a vector of 4 32 bit floats. But the library has to somehow offer all these types, even though they are mostly the same.

The original (0.1) version used the generic-array library to hold the actual data and types like these to force the alignment:

#[repr(align(64))]
Vector64<B, S>
where
    // Some uninteresting and ugly type bounds
{
    data: GenericArray<f32, U4>,
}

And then used transmutes, pointer casts and memory copies to operate these. While it got the job done, the code could only be described as hairy and ugly. Also, there was more unsafe around than felt necessary. It was easy enough to reason about, but it still didn’t feel right.

Welcome const generics

I guess the whole post is about this. The arrival of const generics arrival made it possible to simplify the code a lot. In combination with these, the alignment is not an attribute directly on that type, but is brought in through another marker type (a zero-sized one). It looks like this.

#[repr(C)] // So the alignment ZST is at the start
Vector<A, B, const S: usize> {
    // Forces alignment, but as it is ZST, it does not take any space.
    _align: [A; 0],
    data: [B; S],
}

The code is much shorter, easier to follow and reason about. Due to that, the new version brought some more features (nothing big) and the types dereference to the fixed sized arrays, not slices, which might be more convenient for some code.

There’s still some unsafe around, though. It is needed to handle initialization of the arrays, so it deals with MaybeUninit. There might be nicer ways to do it, but the library tries to use as primitive ways as possible so the auto-vectorizer has an easier way and works more often. This is one place where using old-style index loops instead of iterators is faster. I guess this is because the auto-vectorizer is written with C/C++ code in mind (it lives somewhere in LLVM) and the range checks are trivially removed, since they are to a fixed-sized constant (passed as the generic parameter).

Should you use the crate?

Well, that varies. It is mostly an experiment from my side. You can try it, it should not outright break anything. But always measure the results.

Sometimes, you can get quite good speedups. Sometimes, you get nothing, mostly because even your original code was already auto-vectorized (that thing is quite powerful). Sometimes, you may get a slow down, because the auto-vectorizer gets confused and it starts „juggling“ the registers (I’m not entirely sure what happens and why, but the resulting code is full of moving values between registers there and back).

Anyway, the API is quite close to what you get with packed_simd, with the exception of the vectorize family of functions. So if you rewrite it with slipstream and it is slow, you can try replacing it with that and see if it helps.

The advantage of slipstream over packed_simd is however being able to compile on stable. You can consider it a best-effort temporary solution before the explicit packed SIMD support matures.

Big thanks

Nothing of this would be possible without the work of… a lot of people, I guess. It’s not only the Rust work on const generics that took some time and effort. There’s also the magic in LLVM, years of accumulated smartness and research. And most of it is quite invisible. It wouldn’t be possible without the people that use it either.

So, thanks ☺, you’re great, whoever you are.