Performance experiments with matrix multiplication

One of Rust’s design goals is to be fast. That actually needs two distinct things from the language. First, is it shouldn’t introduce too much (preferably zero) overhead for its abstractions and be fast out of the box. Many people coming from the high level languages (python, javascript, …) find this to be the case ‒ just type the program, compile it (with --release) and it’s reasonably fast. The other, no less important, is allowing the programmer to tweak some knobs when trying to squeeze a bit more speed out of the program.

I’ve decided to test the second a bit and see how far I could go. I’ve chosen matrix multiplication as a case study, for several reasons. I’ve played with it before (in my master’s thesis), it’s relatively simple and the effects of optimizing it can be great. For simplicity, I’ve decided to multiply only square matrices with power-of-two sizes, but these restrictions can be lifted in a real implementation without significantly loosing performance ‒ only the code gets somewhat more complex and hairy.

Some highlights of my findings, before going through details:

Indeed, Rust can get pretty fast and allows playing with the knobs. I’ve compared my experiments with the Armadillo C++ linear algebra library and reached better results. The library was chosen more because it had a nice interface, but it’s reasonable to assume a production quality library is optimized to some level.
The tools I usually use for profiling C or C++ programs and the general techniques are applicable to Rust, without any changes.
The final, tuned implementation is three orders of magnitude faster than the obvious implementation (the exact numbers differ from machine to machine). While it is certainly more complex than the three nested for cycles, the code is still readable and sane. A casual reader should still be able to see what happens, if not necessarily know why it turns out to be faster (and see right away that it correctly computes a matrix product).
Deciding what to optimize and how to change it is about as hard as in C or C++, but the great libraries in Rusts ecosystem makes it much less work to implement the actual changes.

Step by step

I started with the trivial implementation even though I knew it is not optimal (and even though I already had pretty good idea what changes I’d need later on, from my previous experiments in C++). This was for two reasons. I wanted to have a comparison base line. More importantly, it serves as a cross-check. A slow, but trivial implementation can be used to check the faster but more complex (and therefore more error prone) program. If you ever try to optimize something, save the original to cross-check (both the performance and correctness).

Few runs of perf confirmed the expected. Perf is a potent non-intrusive profiler (it doesn’t need to inject the program with its own code and doesn’t slow it down). It runs on Linux systems and relies on hardware support of modern-ish CPUs, but it can give very detailed information both about where the program spends its running time (down to instruction resolution, in contrast with other profilers that go only to the function resolution) and why it is slow there ‒ information like how often that specific branch instruction is mis-predicted or if it fights for a memory location with another CPU. Even though it is somewhat harder to learn, if you need to dig into the details and really mean it with performance tuning, it is an invaluable tool. Also, FlameGraph can show some more visual representation of the measurements. I still miss some capabilities that should be possible, but this is the best thing I could find so far.

Memory layout

The problem was memory accesses outside of the cache. The problem with matrix multiplication is that one matrix is traversed by rows while the other by columns. No matter if you store your matrix in column-major or row-major order (if rows or columns are continuous in memory), one will suit well enough to the CPU caches, but the other will just kill performance. Even if the whole row doesn’t fit into cache, traversing it will at least use all bytes from the cache line (caches don’t work on single bytes, but by continuous aligned blocks called lines ‒ when accessing even a single byte of a line, the whole line is fetched from RAM). One memory load will be amortized across several cells. However, the other direction will need to load a new cache line for each cell. And loading memory from RAM is slow, compared to what the cache can supply. Furthermore, even if we use a small part of the line, it still takes the full-line size out of our cache, possibly kicking something else out.

The solution to this problem is to store the matrix in a Z-order representation. We split the matrix into quadrants, each of them is represented as a continuous interval of the array. Each quadrant is represented in the same way, recursively. If we consider the big matrix as a matrix 2×2 quadrants, we transform the big-matrix multiplication into 8 smaller-matrix multiplications (and some matrix additions, but these are faster, both because they traverse the matrices in tandem, therefore we don’t need to jump through the memory, and because they have smaller asymptotic complexity). The smaller-matrix multiplications are performed recursively. This is a winning strategy, because from some level of small (which depends on the size of the cache), the whole input and output fits into the cache, avoiding the memory loads ‒ loading each memory location many fewer times ‒ once for the multiplication and several times for additions on bigger matrices. This also works well on the cache hierarchy ‒ while a cell of the big matrix had to be loaded directly from RAM in the natural order (because it wouldn’t fit into any of the CPU caches), here it likely loads from the next bigger cache (if our matrix fits into L1, the bigger one probably fits into L2 and so on).

This gives us a speedup of factor about 2, even when we include the conversion from the „obvious“ memory layout to the Z-layout and back (at least on my CPU, your mileage might vary).

The problem here is, we traded better cache locality of the algorithm for a more complex algorithm that does more work overall (not asymptotically, but the stack manipulation for recursion is simply more work than just increasing a counter in a for-cycle). We win on big inputs, but lose terribly on small ones. Let’s use a hybrid approach ‒ stop the recursion (both of the multiplication and of the layout) at certain size and switch to the trivial implementation for the tiny matrices. The best size on my CPU seems to be 8 (I tried only powers of two), which provides about 11× speedup against the base line. We still do the same algebraic operations on the matrix elements, just reorder them.

Threads

Computing each quadrant of the result is independent of the other quadrants. And whenever we have multiple big independent chunks of work, it is worth trying to distribute the work between the CPU cores. Processors don’t get much faster, but they get more and more parallel, primarily by the means of having more cores. A core is conceptually close to a complete, small processor ‒ just multiple ones bundled in one silicon package.

While simple in principle, there are few little details of note.

On each level of recursion, every matrix is decomposed into 4 more. Starting a new thread for each won’t work (there’d be too many of them after few levels), so we either need to stop at certain level (but the number of cores available might not be power of 4), or use a thread pool. A good thread pool is provided by the Rayon crate. This is a real help, as it balances the load better than just guessing the level of recursion where we stop and it has quite simple interface ‒ I simply created a 4-element array for the sub-tasks and let the parallel iterators API do its job.
Sharing tasks between threads has even more overhead than recursion. Therefore, we need to use the hybrid approach as well ‒ stop distributing it at certain level of recursion (possibly sooner than stopping the recursion). If we didn’t do that, it would actually hurt performance (yes, I measured that).
We can’t expect full N× speedup with N available cores. Even if the synchronization cost was zero, the cores aren’t really fully independent processors. They share common bandwidth to RAM, share some of their caches, maybe even some execution units. It is also quite common to limit the frequency depending on how many sibling cores run at the time, to limit power consumption and heating. This is less so with machines having multiple real silicon packages, but there are other factors limiting the maximal overall throughput.

Anyway, using all 8 cores I have gained nearly 7× speedup against single-threaded version and 75× against the baseline. A better speedup can be expected (an was measured) on a machine with more cores, obviously.

SIMD

When it comes to raw computation power, cores aren’t the only thing a modern CPU has to offer. There are vector instructions. An ordinary (scalar) instruction takes certain number of operands and performs the corresponding operation and produces the result (for example, an addition takes two operands). On the other hand, a vector instruction takes N sets of operands and produces N results, just like running N same scalar instructions would ‒ but it’s a single instruction, therefore it executes much faster (likely slower than a single scalar one, though). The N depends on the kind of vector instructions available on the given CPU ‒ the state of the art is 512bit vectors, which means for example 16 32bit floats in a single vector. However, all N sets need to be independent.

There are several downsides to the vector instructions. First, not all CPUs support all kinds of vector instructions. To get optimal results, one needs to compile the binary for a specific processor (or bundle multiple versions of the same function inside the binary, choosing one at runtime). Second, there are traditionally hard to use. Sometimes, the compiler tries to use them on itself, but proving it can do so is often hard and the compiler gives up. Using compiler intrinsics ‒ basically the specific vector instructions, but syntactically called as functions ‒ is tedious and unportable (if you hand-code the algorithm for SSE instructions, you will have to start over when you want to start using AVX).

There are few handy libraries in Rust that help a lot with the problem (all of them currently unstable, but a stabilization is expected any time now). The first of them, stdsimd provides vector types, like f32x16. These represent a tuple of so many primitive types, but the operations on them use the best instructions available ‒ if 512bit SIMD is available, then addition or multiplication is a single instruction. If only 256bit vectors are available, it compiles to a pair of these basic instructions.

The other high-level library is faster. It allows iterating through a large array of primitive types by splitting them into optimal-sized vectors and calling a closure on each. You don’t have to care how large your CPU’s vectors are, the library will choose it at compile time and provide appropriate type to the closure (the actual type exposed, like f32s is an alias for one of the f32x*, where the size is chosen to correspond to the best thing the hardware can support).

Both have the advantage they abstract over the exact hardware capabilities ‒ they can run without any hardware support (just don’t provide any speedup) or can add support for new future SIMD instructions once they are available without needing to change anything in the end application ‒ bumping the dependency version and rebuilding should be enough. And the code still looks like Rust, not like Assembler.

I used the latter. There’s a trick to be done ‒ we need to iterate the right matrix by columns, which (similarly to cache lines) doesn’t go well with vector processing. The library allows for „striding“ ‒ skipping N elements between each two used ones. Even when this is backed by the hardware support, this is slow. So I copy a column out, into a temporary buffer, before processing it. To amortize the cost of copying, I multiply the column with all relevant rows before going to the next one (eg. the outer loop is over columns).

If used alone, vector processing gives 66× speedup against the base line (without using threads). If you wonder how that is possible (if the theoretical maximal speedup would be 16×, as we have 16 lanes in the vector), note that using just the copy-column trick gives 10× (and vectors give us 6 and a bit over that). Combining it with the above recursive approach is a little tricky ‒ getting the vector machinery up to speed has some overhead (the library needs to handle the ends of the array, which don’t necessarily form whole vectors). Therefore, the optimal cut-off for hybrid approach moved to higher sizes of the small matrix from 8 to 256. However, this combination of all above improvements give a speedup of 500× over the base line. We still do the same set of basic algebraic instructions ‒ we just harness the capabilities of the hardware to do them much faster.

Strassen’s algorithm

This is the final improvement I’ve done. This one isn’t directly applicable to anything else but matrix multiplication. But in general, using an algorithm with better time complexity should be the first thing to try (I’ve left it for last because it is more complex and I wanted to incorporate the above improvements into it, instead of writing it multiple times with and without threads and SIMD ‒ but it’s not the right way around usually).

By some clever transformations of the operations, it is possible to replace our 8 smaller multiplications with just 7 (and some more complex additions and subtractions). This lowers the time complexity a bit (as the multiplication is the expensive part) and can directly reuse all our previous improvements. Also, the additions and subtractions can be done with faster, as we can consider the whole matrix a single long array. The downside is, I didn’t manage to do it in-place, so the algorithm needs significantly more memory. I think it could be at least improved somewhat (and I’ll probably do it later on, so I can get measurement from even larger matrices).

The exact formulae how to transform it, as well as the whole theory behind it, is on wikipedia.

The measured speedup on a matrix with side 4096 elements is about 730×. It is expected to be more on even larger matrices (due to the lower time complexity), but I didn’t measure that yet ‒ the 4096 was the biggest matrix I used on the base line, which took long enough (almost an hour). Expected time for 8192-side matrix would be 8 times as much, while the Strassen algorithm gets it done in under half a minute.

The source code

You can look through the code of the experiments at https://github.com/vorner/fastmatmult. It also contains the measured times on several different machines (ranging from low-powered Celeron CPU to a machine with two Xeon processors, each with 10 hyperthreaded cores ‒ having 40 virtual cores total).

Note two things, though. First, to cover all the algorithms, some of them are combined together with generics (eg. generic over distributing it across cores or not) or passing closures around. This makes the code somewhat more complex than an implementation of any single algorithm would be. Thanks to Rust’s monomorphization, the composition of algorithms should come with zero performance hit, but it still makes reading it a little harder.

Second, as mentioned above, Rust SIMD support isn’t stable yet. For now, the code compiles with a specific version of nightly, but there are many changes around the area lately ‒ therefore it won’t compile with the newest nightly. I’ll probably update once it stabilizes (which may or may not change the measured numbers a bit).

Also, the code contains slides for a presentation about the experiment. I’ve done an internal company talk about the experiment. As there’s no company-specific know how in it, it might get published eventually.

The graphs

It wouldn’t be a proper benchmark if it didn’t have some nice graphs, would it? This one comes from my own machine, you can generate the others (or measure your own) from the source code.

Graph

Both axes are on the log scale.

Future improvements

There are some areas that could still be tried:

Trying the strassen algorithm with reusing the temporary matrices and final results, which will lower the memory consumption and could lower the time as well (if it uses less memory for the same input size, larger input could fit into the cache).
Store the matrix in explicit vector types (eg. f32x16). This could lead to getting rid of the slow-down in faster, which is caused both by having to handle unaligned vector data (the fastest way to use vector instructions is to have the vectors aligned in memory to cache line boundaries) and by not having to care about the possibility of the array not being multiple of the vector size.
If there ever was an opportunity to give the talk again somewhere else, the presentation could use some improvements 😇.