Atomics ☢ and memory ordering

Taming multiple threads is a mess. Not only many things can happen all at once, but what you wrote in the code isn’t exactly what happens in the CPU. To gain some more performance, the compiler cheats if it thinks nobody is watching. It can reorder instructions or throw some of them out if they look useless. The same happens in the hardware. Furthermore, there isn’t just one RAM, but each memory location can live in different caches at each time and some of them are private to each CPU. It would not make do to publish all the local changes to one’s cache right away.

This, if not handled with care (eg. synchronized) leads to the dreaded undefined behavior.

Besides the classical synchronization primitives (mutexes, barriers, …), some languages offer atomic types (C++ and Rust amongst others). They are cool, because they allow implementing the classical primitives and some really fast multithreaded data structures (like the ones in crossbeam). Unfortunately, they are much harder to use than just placing something behind a mutex, partly because many of their methods have this magical Ordering parameter which influences with what and how they are synchronized.

Most documentation explains it in terms of guarantees, causalities and happens-before. These are good for a formal specification, but in practice it feels like teaching a set theory to an accountant ‒ not really what is needed in practice. The closest explanation to an easy to grasp one I could find is in the nomicon, but I decided to try and give it a go myself. This’ll be more intuitive, I just hope it is not at the cost of correctness.

What is an atomic anyway?

Atomic types don’t differentiate themselves by the memory they live in ‒ it’s ordinary RAM as with any other type. They don’t take more or less memory than the non-atomic variant, their bit representations are identical. Only, the compiler generates different instructions to handle them (and avoids some optimisations), which forces the processor to do things without cheating. In theory one could use the same variable sometimes atomically and sometimes not, but it would make little sense in practice.

However, because both the compiler and the processor want to cheat (and you want them to cheat, to get a faster program), the ordering parameter says how much cheating is allowed with each operation. This is one of the reasons atomics are usually faster than placing mutexes everywhere ‒ they control the cheating in much finer detail (the other reason is the mutex enters kernel space to suspend the thread if it needs to wait for another thread to unlock it, which is expensive).

Some guides recommend to use the Sequentially Consistent ordering if you are not sure. I don’t think this is really a good advice ‒ not because of its effect on performance. If you don’t know which ordering you can afford to use, you also don’t know if using atomics is enough or if you have to go with a mutex or some other stronger thing. Mutex is the safe one and when you managed to prove an atomic is enough, you probably already know which ordering you need to use with it.

Important part of understanding the orders is that they don’t define how the atomic itself acts. They influence only the synchronization of other memory locations ‒ how much other synchronization piggy-backs on that operation. Obviously, less is faster.

Relaxed order ‒ sanity just for me

All atomics are sane in the sense that accessing them from multiple threads at once will not produce undefined behaviour and that the atomic itself will act as you’d expect ‒ if you start at 0, increment it 10 times by 1, you get numbers from 0 to 10, each one exactly once and the value at the end will be 10. A later increment in the same thread will result in a larger number. The threads will see the operations of one atomic in a single common timeline (eg. if the atomic reaches 10 and my thread saw it, it will not go back to 5 later on unless someone decreases it again).

let a = Arc::new(AtomicUsize::new(0));
let mut threads = Vec::new();
for _ in 0..5 {
	threads.push(thread::spawn(|| {
		let first = a.fetch_add(1, Ordering::Relaxed);
		let second = a.fetch_add(1, Ordering::Relaxed);
		assert!(first < second);
	}));
}
for t in threads {
	t.join().unwrap();
}
assert_eq!(10, a.load(Ordering::Relaxed));

Apart from this, the relaxed order doesn’t promise anything else. Not even order of operations in relation with other atomics (eg. when incrementing atomic A and B, it can happen that one thread sees them incremented in this order while another thread in the reverse order). Yes, different threads may see them in different orders (don’t worry about it too much if it feels a bit schizophrenic). You can get some pretty weird behaviors, but only with multiple variables (atomic or not) in play.

What good is it?

If you need to generate unique IDs (at least until you run out of distinct numbers and wrap around).
If you need some information (eg. „please stop“) to reach another thread.
If you count something and need the exact number at the end of the program.
For performance metrics, because you don’t care about little inconsistencies in their relative counts (you can get few more answers than requests at the time you read them, because the answer atomic was faster in sending its updates to the reader thread).

Not all architectures know this order. If they don’t, they fall back to something stronger.

Release & Acquire ‒ sending data from one thread to another

In addition to what the relaxed order provides, this one makes a contract between two threads. If one of them does a release operation on an atomic and another one then does an acquire operation on the same atomic, all the data the first thread has written before the release becomes visible on the second thread after the acquire. The pair of operations is a rendezvous point to hand over memory.

What good is it?

Writing a mutex, spinlock, channel or some other interesting data structure ‒ the pair of operations ensure that the original thread „publishes“ all its changes from the cache to RAM and the other one „downloads“ all the relevant memory (this is a simplification of what actually happens, there’s something called cache coherence in play so it doesn’t have to go all the way to RAM, but it’s a good enough mental model).
It allows creating bi-directional synchronization (a mutex first acquires the lock in form of an atomic acquire, gets the current memory and then releases it, publishing its changes) or one-directional (a writer end of a channel could get away with just release operations, while the reader would be enough with acquire ones to get what the writer published, saving the bandwidth by not sending changes the other way around).

let spinlock = AtomicBool::new(false); // not locked
...
while spinlock.compare_and_swap(false, true, Ordering::Acquire) {}
// It's locked here
spinlock.store(false, Ordering::Release);

Few important things of note:

There’s no synchronization with any other thread guaranteed.
There’s no guarantee about the memory the first thread changes after the release operation.
No relation is enforced if the operations happen on different atomics.
To correctly synchronize some other memory, the first thread must have finished writing into the memory by the time it does the release operation. It „releases“ its ownership of the memory.
It works only if the release operation is writing (store) and the acquire operation is reading the atomic (load). Using them in reverse is illegal.

There’s also AcqRel ordering, which is a hybrid one. It acts like both acquire and release with an operation that is load-store (like fetch_add), allowing more complex patterns like exchanging of memory on a Barrier. It is illegal on load-only or store-only operations.

Sequentially consistent

This is just like AcqRel (or Acquire on load and Release on store), but in addition to that, it also synchronizes the order of operations with other atomics, or more correctly, with all SeqCst operations across the whole program. All threads have a single „timeline“ of all the SeqCst operations on all the atomics. It is still possible for the weaker operations to move around as they wish (well, unless they happen on an atomic that also did a SeqCst operation ‒ such a weaker operation must stay on the same side of that SeqCst operation, because it is on the same atomic, but it still can move around SeqCsts of different atomics).

What good is it? Honestly, I haven’t found a use case that couldn’t be solved by something weaker yet. I’d be glad if you can provide one.

Other synchronization points

Just like a pair of AcqRel operations on the same atomic is a place for two threads to meet and exchange their news about what happened in the global RAM, there are other such events. There’s likely some atomic operation hidden inside them, so in that sense it could everything be just this, but on the outside, these must also enforce the propagation of news for the thread/memory model to make any kind of sense:

Locking/unlocking a mutex, meeting at a barrier.
Sending data through a channel.
Spawning or joining another thread.

Summary

Relaxed: Just relax and don’t worry about details like the order of things or other memory.
Release: Make a press release about what changes you made to the memory.
Acquire: Get hold of all the news.
SeqCst: Make sure everyone involved is consistent about what happened when during crossexamination.