More tricks up in the ArcSwap’s sleeve
This is a continuation of the Making Arc more atomic post. In short,
ArcSwap is a place where you can atomically store and load an Arc
, similar
to RwLock<Arc<T>>
but without the locking. It’s a good tool if you have some
data that is very frequently read but infrequently modified, like configuration
or an in-memory database that answers millions of queries per second, but is
replaced only every 5 minutes. The canonical example for this is routing tables
‒ you want to read them with every passing packet, but you change them only when
routing changes.
I’ve been asked a few times how it works inside and I don’t want to repeat myself too much, so here we go. This is the high level overview of what happens under the hood, you’ll have to read the code for the fine details, proofs of correctness, etc.
I’ll be talking mostly about Arc
, but the thing is generic and can be made
to work with other things ‒ notably, Option<Arc<T>>
is supported by the
library, and if you have your own version of Arc
, you can implement a trait
on it and there you go too.
The devil is in the details
In the heart of the thing there lies an AtomicPtr
. As Arc
can be turned
into a raw pointer and created from one, it can be stored in there. That
conversion keeps the reference counts. The swap
method is therefore easy ‒
one pointer with its reference goes in, another gets out. Just make sure the
atomic orderings are correct. store
is just a swap that throws the result
out. That’s easy.
What does some kind of load
look like? It needs to read the pointer, and
because it returns one Arc
, while keeping another „virtual“ Arc
inside
the storage, it needs to increment the reference count. But here lies the
problem, as by the time we get to incrementing it, some other thread might have
swapped the pointer out of the storage, dropped its copy, reached zero refcount
and deleted the data, including the reference count itself. Ouch. Not good. We
can’t increment non-existing reference count.
So, the original solution was a kind of uni-directional short-term lock which
prevented the writer (the one doing swap
) from completing before we
increment the refcount. Read the original article about the details, it is all
about that lock.
Properties of the lock
The original one (now called generation lock) had some good properties.
Specifically, the read operation was wait-free, which is great. Literally, a
reader stopped for nobody, performed its fixed sequence of operations and
continued with a copy of the Arc
in its hand.
There were just three downsides:
- While the lock is able to deal with constant stream of readers and a writer will finish even in face of that, if a reader gets stuck in the critical section, all writers wait forever.
- Even though the read operation is wait-free, all readers write into the same cache line ‒ specifically, the lock and the reference count. So if there are many concurrent reads, they get slowed down because CPUs fight over which one of them would be writing into the cache line, making it quite slow (well, here we are talking on scales where going to fetch data from RAM is considered eternity).
- The storage for that pointer was bigger than the size of a pointer, inflating any data structures built of these.
Sharding the lock
For this library, read performance is much more important than write performance. Therefore, the first changes I’ve made were to use multiple instances of the lock. Each reader would pick just one of them and lock it, while a writer would have to wait for a permission from all of them. By making sure the locks align to a multiple of cache lines size, the chance of two reader threads colliding on the same cache line is lowered (while making it slower for writers).
However, this doesn’t solve the contention on the reference count. If the only thing I want to do is to read one integer from configuration behind that pointer, incrementing and immediately afterwards decrementing the shared reference count is a big deal.
So I’ve made another method, the peek
, available. Instead of locking,
incrementing the reference count and unlocking, it returns the lock guard with
access to the data. The caller reads that i64
from there and unlocks, never
getting the actual Arc
created (but keeping the data alive for the whole
time) ‒ and never touching the reference count.
This solves the problem for very short reads, but runs the risk of making the
writer wait in a busy loop for longer if my peek
into the data is not so
short. I could still take the load
, which creates the Arc
with its
reference, but that feels wrong.
Making the lock configurable
By default the storage uses a global set of locks. The idea is, if a thread is
busy locking one ArcSwap
, it won’t be trying to lock a different one at the
same time. That way, ArcSwap
itself has the same size as a pointer, but runs
the risk of blocking writers on unrelated instances of ArcSwap
.
It is possible to fine-tune the behaviour by a type parameter. The library contains implementation of private locks for each separate instance. This is even used inside the signal-hook crate. There, this lock is the only allowed possibility due to restrictions what can be done inside a signal handler. However, the signal handler may run for a longish time and it would not do to block unrelated processing. With a private lock instance, only adding or removing signal handlers is affected (which usually happens just on application startup).
What about hazard pointers
As you can see, the lock approach isn’t perfect.
I’m certainly not the only person to take an interest in a similar problem. I’ve
came across something called hazard pointers and one specific implementation,
the AtomicArc
was certainly a very interesting read. Unfortunately, as far
as I know, it’s not yet finished and released.
On a very very high level, we have a global „don’t you dare to free this, I’m using it right now“ registry. If I want to start doing something with a pointer, I read it, write it down into the registry and check by reading it once more (to avoid races). Only after this I can actually access the memory. When done, I remove it from the registry.
If I want to free a pointer, I first remove it from the storage (so nobody else can read it any more), then walk the whole registry. If I don’t see the pointer mentioned there, fine, I can safely free it. If it is there, I mark it in the registry with a note „you’re now responsible for freeing it“, passing the problem onto some other unfortunate thread. That thread in turn will try to free it (with the same check through the whole registry) when removing its entry from the registry.
This is all very nice and shiny, but I didn’t like two details about it:
- Searching for a free spot in the registry can take a long time. Additionally, a reader thread may become responsible for freeing the pointer, forcing it to walk the registry. I want my readers to be fast (and wait-free, not just lock-free). Like, always, not just most of the times. Not for some actual reason, it is simply a matter of personal pride.
- While the idea is relatively simple, code that handles all the corner cases in the registry in face of concurrent lock-free access (it wouldn’t make much sense if we put locks into the registry, would it) turns out quite tangled. Combined with the general inability to write unit tests for these corner cases (they happen as a result of thread scheduling, which CPU is faster, etc), this feels like a dangerous path to take. I do enjoy hard problems and such, but there are certain limits, especially if I want to claim some confidence about correctness of the thing. Furthermore, all the „cleverness“ of the code slows it down visibly in benchmarks.
The hybrid approach
This is the implementation of the third loading method, lease
. This one
should probably be considered the primary one for most use cases.
Eventually I’ve decided to trim the hazard pointers down to a bare minimum. If anything at all doesn’t go according to the best-case scenario, it falls back to a full load instead. That one is still formally wait-free and by making it rare to happen, the problem with contention on the reference count should be effectively gone.
Furthermore, the responsibility for freeing something is not passed through the
registry. A record in the registry means there’s an Arc
somewhere which owes
one reference (that is fine as long as we don’t reach 0). For this reason, it
is no longer called hazard pointers in the code, they are debts. So instead of
passing the responsibility to walk the registry, the writer takes the debts onto
itself (removes it from the registry) and pays them (by incrementing the
number). Actually, it pre-pays one upfront and removes it at the end, to be on
the safe side.
The read operation therefore tries this, in order:
- If this is the first access from the current thread, set up fixed number of slots in the registry. They are going to be for the exclusive use by my thread (well, the code is ready for potential sharing of the slots between threads, but no kind of slot stealing is implemented and probably not worth it). Only this part is just lock-free because the new slots need to be linked into a linked list and there may be retries doing that. Future reads from this thread will be wait-free.
- Read the pointer.
- Search for an empty slot through my fixed number of slots. If not found, fall back to full load. If found, register the debt.
- Re-read the pointer. If it’s the same, we’re done.
- If it is different, try to remove the debt from the registry. If it was
already paid by someone else, we have a full
Arc
in our hands, which is great and we are done. If not, fall back to full load.
Just not to confuse people reading through the code, the Lease
guard does
not store the full Arc
as an Arc
, only as the raw pointer and that
Arc
is only virtual (eg. there’s the reference count for it).
When dropping the guard, we check if we owe a reference. If we do, we actually have nothing to do (except for removing it from the registry) ‒ we simply make it even by stopping to exist. If we don’t (either because we’ve done a full load or because someone paid our debt), we decrement the reference count.
As long as the current thread doesn’t hold more leases than there are slots,
this is pretty fast ‒ comparable to lock & unlock of uncontended Mutex
(Muteces are slow only if locked, because then they have to wait in line, and go
to the OS to put the thread to sleep…), but with the ability for arbitrary
number concurrent readers ‒ no risk of being locked or slowed down.
Benchmarks
The repository contains some benchmarks. The ones I consider the most relevant
are doing the measured operation (eg. peek
) while some other threads do
„background noise“ traffic ‒ like writing new values all the time. There are
many combinations of the measured operation and how many threads do the
background traffic and what kind of background traffic. This is done with
several alternatives to compare with [ArcSwap
].
However, benchmarks are not the real thing, for many reasons. First and foremost, real applications do other stuff than just access one shared piece of data, therefore the differences might be hidden in the costs of whatever else the application does. Apart from that, it turns out CPUs are sensitive to mostly whatever. Like, spinning up a parallel thread that doesn’t access anything near our data nevertheless may slow our core down because of the temperature gets higher or because the power subsystem is not able to feed both. So while interesting, benchmarks are not to be taken too seriously.
Areas of improvement
I’m open to ideas what and how to improve. In particular, I believe there is room for improvement in the writer performance (that one is rather slow) ‒ but I want to do that only if reader performance doesn’t get hurt. This crate is biased towards readers, if you need something „balanced“, there’s a place for some other implementation.
Also, discovering places where the documentation is not perfect (and sending a PR) would be highly appreciated. Or coming up with a nicer API.
If you feel like reading the proofs and the code and double-checking my reasoning is sound and that I didn’t overlook something would also be very valuable.
Implementation for other languages
I’ve already been asked if I would like to port this to another language
(specifically, C++). My answer to that is no, I wouldn’t. This code is
already quite challenging even with all the help and checks I get from the
Rust compiler (which is a lot even if the code is using unsafe
extensively).
And I don’t have the free time for that. Besides, I don’t think C++ shared_ptr
can be turned into a single pointer.
However, if you want to go ahead and implement it in some other language, I’ll be more than happy to answer any questions and walk you through the design or just talk about it.