Don’t panic
Rust offers guarantees about memory safety and data race freedom. But it doesn’t
stop there. It nudges the programmer in the general direction of good habits.
For example, the Result
is designed in a way it is easier to handle the error
than ignore it and even ignoring it must be a conscious choice. This helps build
software that is robust and with lower concentration of bugs.
Designing API in this way, in a way that the path of least resistance is the correct one, is often called „The Pit of Success“. At its most basic form, this is about choosing the right default behaviours. No matter what your use case is, the default choice does not have to be the best possible, but it must not be outright wrong. If you don’t read the documentation, you should stumble upon something that doesn’t have hidden gotchas. Something that doesn’t compile is fine, but something that seems to work in 99% of cases, but kills a kitten in the one remaining percent is not.
And there’s one particular area where I believe Rust falls pretty short to do that. I can guess at the reasons why it ended up as it is, which I actually do at the end of this article.
The area is handling situations the program doesn’t know how to handle ‒ panicking.
What is that panic thing anyway?
Rust comes with different ways to handle error conditions. In its simplest, it divides the errors in two groups (technically, into three, but the third one is for things where you no longer even have a viable definition of sanity, so the program simply gives up and I’m not going to talk about that).
The first one is for things the program knows how to cope with. The programmer
foresaw that you could lack disk space, or that the program could be given
invalid input by the user. In the ideal world where the programmer writes only
correct programs, this includes everything the program encounters during its
lifetime. Rust handles these errors mostly through the Result
type.
The other kind is for things the program doesn’t know how to handle. That, in some definition of correct programs, means bugs. Either something inside the program broke, or the programmer forgot to take some real situation into account.
Admitting all code contains bugs and therefore it makes sense to have special language support for handling bugs is definitely a very enlightened approach.
Rust has the concept of panic to handle the bugs. Every time you index out of
an array, an assert
fails, or an unreachable!()
piece of code is executed, a
panic happens.
By default in such case the runtime starts unwinding the stack (removing
functions one by one) of the current thread, calling destructors of things on
the stack, until catch_unwind
is reached or until the whole stack is
unwound, in which case the thread terminates. However, if it was not the main
thread, the show goes on.
Why is this a problem?
First, panicking is non-local and invisible flow control. Mostly anything can
panic. Therefore, mostly nobody takes that into account when writing code and a
panic can lead to leaving things in an inconsistent state. This is usually
fine, because everything in that one thread dies anyway. Except when this is not
fine, like when your drop
gets confused by the inconsistent state, when
something doesn’t die because it is shared or when you play around with unsafe
code. Yep, be extra careful when combining callbacks and unsafe
. Like, even
more extra careful than when handling just unsafe
. Because that callback can
panic and then the rest of your unsafe
function won’t run. By the way,
catch_unwind
(or panic
itself) can allocate, and there are situations when
you can’t afford to allocate (inside signal handlers, when you play with fork
,
if you are the allocator itself), so there’s no reasonable way to panic safely.
Panicking through some things is not safe. For example, through an FFI boundary. So if you pass a rust callback into some C function (or a function that pretends to be a C function), such callback must never panic. But, well, most anything non-trivial in Rust can panic, because panic is for handling bugs, and anything can contain bugs. So you have to somehow wrap the callback and deal with the possible panic, but it is all too easy to forget (if you do panic across FFI boundary, Rust will abort the program, but that was not always the case).
In all these cases, you either forget and magnify the bug, or you pay the tax by making your code more complex (and run the risk of introducing more bugs because of the complexity).
Furthermore, if you start a thread to perform some work ‒ to clean up some
things in the background, to do some work and tell you when it is done through a
channel, or bunch of threads that all synchronize with each other through a
Barrier
and that thread dies with panic, you get an application in
an inconsistent state. The application runs on, but without the thread to do the
cleanups, or the work is never finished and the main coordinating thread is
never notified or all the other threads wait forever for the one that went belly
up. And you aren’t likely to see anything in the logs either, because unless you
register your own panic hook, it prints a message to stderr
, not to where you
currently log (again, you can change that with for example log_panics
, but
you must not forget ‒ and you need to know in the first place that you must not
forget).
This is obviously wrong ‒ having an application in a half-dead state in
production, but not getting it restarted, for who knows how long, is something a
robust application doesn’t do. You may have noticed this virtually never happens
to C or C++ programs (because C can only abort
and unhandled exception in C++
also aborts), but very often does to Java ones. An exception backtrace scrolls
through the screen when you start it, and the application sits there not doing
what it should, in some zombie state ‒ not alive, but not dead either.
As an exercise, try to come up with a way to fix each of these (each one is the natural way to write the thing and each one is wrong around panics), preferably with only standard library primitives:
fn main() {
thread::spawn(|| {
loop {
thread::sleep(Duration::from_secs(3600));
// If this panics, we never get any cleanup done, eating
// resources forever. Bad :-(
do_periodic_cleanup();
}
});
while (keep_running()) {
let request = receive_request();
let response = do_computation(request);
send_response(response);
}
}
fn main() {
let (sender, receiver) = mpsc::channel();
let sender_a = sender.clone();
let sender_b = sender;
// If either of these panics, only the other things will get
// generated from then on, ignoring the others.
thread::spawn(move || produce_as(sender_a));
thread::spawn(move || produce_bs(sender_b));
for a_or_b in receiver {
consume(a_or_b);
}
}
fn main() {
let cpus = num_cpus::get();
let barrier = Arc::new(Barrier::new(cpus));
let mut threads = Vec::new();
for _ in 0..cpus {
let barrier = Arc::clone(&barrier);
// If one thread (other than the first) panics in
// preparation_phase or real_computation, the program
// never finishes.
threads.push(thread::spawn(move || {
preparation_phase();
barrier.wait();
real_computation();
barrier.wait();
cleanup();
}));
}
for thread in threads {
// This would nicely propagate the panic… if it wouldn't
// deadlock on the first thread.
thread.join().unwrap();
}
}
Also, you pay the price for being able to panic even though you never do. Functions that contain at least one value with a destructor (even a generated one) need to create „landing pads“ ‒ markers on the stack that are used during the unwind. These don’t come completely free, so this is against the philosophy of paying only for what you’re going to use. These costs are somewhat defensible in C++, where exceptions are commonplace, but panics should not happen in a correct, production program at all.
As a side note, I wonder if one could implement unwinding using debug info and some global lookup table of what destructors to call in a function depending on the position in the function ‒ that would be probably significantly more expensive to unwind, but the only cost when not unwinding would be including the table in the program binary).
Possible solutions
If there was a time machine that could send a message back in time, I think the
right solution would have been to implement panic
as always aborting the whole
program, full-stop. Or with a hook, but the default definitely being to abort.
(and renaming it to a more descriptive bug
). In general, it is a pretty
stupid idea to try to do a clean shutdown if the program is in an inconsistent
state and you have to be able to cope with unclean shutdown anyway (OOM killer,
cut power, …). All service managers handle programs that died, but you need to
have your own code to check the health of your program if it can end up in an
undead state.
What you can do now, though, is setting this in every binary crate you write:
[profile.dev]
panic = 'abort'
[profile.release]
panic = 'abort'
This way your program will die instead of staying in an undead state, saving the costs for exorcism in your server dungeon. This doesn’t help you if you are a library author, and the standard library still places the unwind landing pads, though.
I don’t know if it would make sense, but cargo new
could actually put this
into newly created projects. Changing the default behavior for already existing
applications that didn’t specify panic strategy is probably not going to cut it
due to backwards compatibility.
Some help could be having synchronization primitives that offer a reasonable way
to cope with panicking threads. A Mutex
can get poisoned if it was locked
while panicking and anyone else touching it will then get an error (which is
commonly handled with unwrap
, propagating the panic). Furthermore, having a
thread manager that terminates with an error when any of the managed threads
dies would help, as would a channel that would include notifications about dead
senders to the receiver or a barrier that waits for as many threads as there are
handles (a dead thread would drop its handle). If anyone has too much free time
and doesn’t know how to go about it, I’m willing to mentor the work.
What lead to this state?
I believe the biggest reason is that Rust changed its core philosophy before reaching 1.0. It stared as something very similar to how Go looks like, or how Erlang looks like. It had green threads and panicking was supposed to be often-used error handling mechanism.
This paradigm was dropped (AFAIK because there was no reason to have two Gos just with different names), but some traces are still visible ‒ panics being one of them.