If you want performance, cheat!

Have you ever been banging your head against a performance problem (or percieved performance problem, the kind of „this really should be possible to make faster“ kind), running the profiler, tweaking, shaving off milliseconds, only to eventually discover (or be advised by someone) a way to completely sidestep it by some kind of neat cheating trick?

Sometimes, I feel like there are plenty of these in our industry and each one is nice to know or at least in a way interesting. So I’m going to share two such cheats I’ve lying around in my code. You can share your favourites too.

Atomic Arcs and caching

I’ve written the arc-swap crate (and I’m slowly working towards a 1.0 release, but there’s always something else to work on too so it takes time). The idea of the library is that there’s a place where an Arc can be stored. This place can be shared between multiple threads and the threads can read the Arc or change it to point somewhere else. Somewhat like RwLock<Arc<T>>, but faster and without the locking.

This can be useful in situations where there’s some global-ish data structure that is very often read but seldom updated. Thing about configuration that can be updated by the user but is used all the time, or a routing table that is used to route each packet, but might need to be changed upon an occasion.

There are a lot of little ugly technical details inside that, mostly coming from the fact that if you don’t want to lock, you have hard time making sure the writing thread doesn’t decrement the reference count to 0 and drop the value just before another thread tries to increment it and claim shared ownership. Doing this safely, without any locking at all and fast at the same time is still a topic of academic research.

So while arc-swap is faster than RwLock and suffers much less from contention between the reader threads (there’s almost no penalty from arbitrary number of threads trying to read it at the same time), I’d like to make it even faster. The hot path still contains some atomic operations, one of them in the SeqCst ordering. This is slow (relatively speaking). And while I’m trying to keep an eye on what results of the research could be incorporated, this is unlikely to get an order of magnitude speedup.

That speedup came when someone on Reddit recommended a nice trick (sorry, I don’t remember their name ☹). Most of the time the value read from there is the same as the one read last time. So each thread keeps its own cached copy of fully loaded private Arc. Instead of loading it from the shared storage each item, it only revalidates the cached value.

Instead of the whole synchronizing dance, it simply loads the value of the raw pointer from the shared storage with a Relaxed ordering. That one is fast ‒ depending on the platform, it is either just the very same load as an unsynchronized read from a variable (it might disable some compiler optimizations, but doesn’t incur expensive memory barriers or locked instructions) or something very close to that. It doesn’t synchronize the pointed-to value in any way and doesn’t protect the reference count. But that doesn’t matter, because we use the pointer only to check that it is still the same as the one we have. If yes, we are fine using the value we already have. Only if it is different (which is the rare case) we do the whole loading operation and store the value for later.

And this does bring the desired order of magnitude speed up. It comes with its own set of disadvantages, like having to keep the Cache around, so it’s suitable for singleton variables, but not for interlinked data structures, and the old value’s drop might get delayed. Therefore, this is an add-on on top of the basic ArcSwap, instead of replacement.

The speedup probably doesn’t matter in practice. Even the speed of the full loading is going to be dominated by the actual use of the data. But all that is certainly cool and who wouldn’t want their hot path to be guaranteed not to wait on any kind of lock whatsoever. It also sidesteps the debate if it’s OK for async application to use the usual blocking locks if they are locked for a short time, because this just doesn’t lock at all.

Bumpallo herds

The above is not a new thing. But the following one is. But a little bit of backstory first.

Let’s say we want to write an algorithm that needs to allocate a lot of small things, assemble some kind of data structure out of them. Then it plays with it for a while and eventually throws it all away. One such example could be parsing an input into an AST. The usual way is to just put each node into a Box. But that is slow, because allocating each means talking to the global allocator. Furthermore, getting rid of the data structure means walking through all the nodes and talking to the global allocator again, for each freed one. Don’t get me wrong, I don’t mean to suggest the authors of the allocator were sloppy. That piece of code is probably one of the most optimized things around. It is slow simply because its goal is complex.

Doing flat data structures (using long Vecs instead of linked lists or having fields inline in bigger structs) is one of the ways Rust accomplishes its speed. But recursive ASTs are not really that well suited for that and that’s where we can use an arena allocator (that’s how Rust folks call the thing), bump allocator or a memory pool (that’s how Apache calls it). The idea is that it allocates a big block of memory from the global allocator. Then, when the application needs a small chunk, it bites it off that big block, only moving a pointer in the process. There’s no per-allocation bookkeeping to do, it’s super fast and as a bonus, the memory is close together, having better cache locality. And when it’s time to get rid of the whole data structure, the big block is returned to the global allocator all at once, without the need to traverse all the links in there. The downside is, there’s no way to deallocate single allocations, only everything at once (technically, it can be extended to support stages, deallocating the newest bunch of allocations and returning to an older state, but that’s not relevant to our story).

The one I’m fond of is called bumpalo. It needs only 11 fast instructions on the fast path to allocate. This gives us an order of magnitude speed up in a very non-scientific and synthetic benchmark (but it can really speed up your program if it’s allocation bound, for example rustc uses a similar thing at various places). This is a nice trick in itself, but not the one I want to write about.

Besides limited ways of deallocations, there’s another downside. The allocator is not thread safe. It really can’t be made thread safe and still stay this fast, because even a single fetch_add will be a significant slow down compared to that 11 non-atomic instructions, and we would likely need more than that fetch_add.

So we gained fast allocations, but lost the ability to parallelize our workload easily with rayon or crossbeam’s scoped threads. If wanted to do something like this (let’s assume that the usizes are actually some complex nodes in our AST and we link them together):

let arena = Bumpallo::new();
let data: Vec<&usize> = (0..10_000_000)
    .into_par_iter()
    .map(|n| arena.alloc(n))
    .collect();
dbg!(data);

The worker threads would be fighting over the one arena, which would be incorrect (data races on the allocation pointer and such) and slow on top of that (fighting over the cache line where it lives). Rust is right in refusing to compile this cursed code.

Rayon has the map_init method that lets us create a separate instance for each thread.

let arena = Bumpallo::new();
let data: Vec<&usize> = (0..10_000_000)
    .into_par_iter()
    .map_init(Bumpallo::new, |arena, n| arena.alloc(n))
    .collect();
dbg!(data);

But that doesn’t work either, because the arenas with the memory blocks in them are dropped once the rayon task terminates and the references become dangling. Dang!

This „choose only one of arenas or threads“ was frustrating me long enough so I’ve decided to go and attack it headfirst. And failed at the first several attempts.

Sharding

Obviously, using a Mutex<Bumpallo> is not the right way, because the threads will contend on the mutex and make it an order magnitude slower than using the global allocator (yes, I’ve tried).

A textbook way to overcome lock contention is to just have more of them and lower the chance of two threads meeting on the same one. We don’t really care which arena the chunk of memory comes from.

So let’s have a data structure like this:

struct Herd {
    instances: Vec<Mutex<Bump>>,
}

During each allocation, it would pick one of them in some way and hoping it would be free. If not, it would try the next one (in circular manner). To lower chance of collisions in future, it would store the index in a thread local variable and start at that one next time. The hope is, the threads would eventually settle each on their own unique index and their mutex would always be free, which is well known to be quite fast. Only when threads get rescheduled between CPUs or such would they have to shuffle again.

While the first assumption about the threads getting an unique index fast enough worked well, it still turned out to be slow. Any kind of read-write atomic operation (like fetch_add or compare_exchange) will ruin our great speed of fast 11 instructions, remember? And, what would you guess is hidden inside the Mutex thing? Even implementing a bare-bones locking without all the mutex bells and whistles (like thread parking, as we simply move onto the next lock) won’t help. So, are we completely out of luck?

Rescuing the thread instances

We can’t afford to put the arena into a thread local storage (yes, that is slow too!), but the thread can certainly keep it in a local variable. The problem is that when the thread (or rayon task) terminates, the arena either needs to be dropped (which frees our precious data structure) or moved somewhere else (which ends the lifetimes for which we can have the memory borrowed, according to Rust borrow checker rules).

But there are ways to make sure some memory or references are not really invalidated in practice. If we put something into a Box, we can move the box around as we like, the heap allocation will still stay at one place. We could guess that the Bump thing contains something like a Box or Vec inside too, to hold the actual memory block, but let’s not take any risks for now.

So, if we take the Box<Bump> thing and move it out of the thread or task when it ends, all the stuff we allocated from it is technically still valid, only the borrow checker is too limited to see that. Therefore, if we sprinkle our algorithm with enough unsafe, it would be fine.

I don’t know about you, but I don’t like to see unsafe intertwined with the business logic or complex algorithms. I can do complex algorithms. I can do unsafe. But please don’t make me do both at once. So what I did is I locked that really small bit of unsafe away.

I have a Herd (bumpalos are herd animals, right?). This one can lend out an arena ‒ a small wrapper around the Box<Bump>. Once that wrapper gets dropped, the boxed bumpalo gets returned back to the herd. This way the memory block lives on, at the same address, so our references can be tied to the Herd, not to the lifetime of the small arenas.

And because each thread has its own independent arena for its lifetime, we sidestep the whole slow synchronization issue and are still at these really great 11 instructions (assuming rustc is able to see through the box and optimize the repeated indirection out into a single one at the beginning).

If you want to have a look, I have a proof of concept. There’s a lot of code to be written, documented, tested and proven correct ‒ I have only one of the many allocation methods for now for trying it out. But I intend on finishing it and publishing it one way or another eventually (currently I’m waiting on an answer from the bumpalo maintainers if they’d like it in their crate or if I should keep it as a separate crate myself).

And here are results of the unscientific benchmark on my computer (some oldish 8-core AMD buldozer thing, your mileage will vary). The benchmarks construct some long singly linked lists:

test alloc_directly  ... bench:  58,620,380 ns/iter (+/- 3,793,463)
test alloc_multi     ... bench:  20,452,760 ns/iter (+/- 3,603,708)
test herd_multi      ... bench:     938,045 ns/iter (+/- 533,165)
test herd_single     ... bench:   3,284,031 ns/iter (+/- 395,678)
test locked          ... bench: 318,254,662 ns/iter (+/- 13,216,651)
test single_threaded ... bench:   3,492,762 ns/iter (+/- 236,510)

The alloc ones are directly using Box. The locked one is Mutex<Bump>, single_threaded is vanilla Bump on one thread. The herd ones are the thing described above (both in single and multiple threads).

Other neat performance tricks

As I mentioned above, there’s a lot of these little tricks around that can really help out. For example, 3D graphics is full of them (Have you ever tried to render a sphere with only 8 triangles? Or with one square and shaders?). I’ll be happy to hear your favourites. Computers are not getting faster that much any more, but clever code that can be reused could help make the software to run faster.