40 millisecond bug
This is a small story about tracking down a production bug in a Rust application. I don’t know if there’s any take away from this one for the reader, but it felt interesting so I’m sharing it.
A bit of backstory
In Avast, we have a Rust application called urlite. It serves as a backend to some other applications, provides them a HTTP API. It’s in Rust because it is latency critical. Latencies of most requests are under a millisecond.
It was written with
hyper-0.12 to deal with the HTTP. We
were quite late to update to newer versions, in part because it worked fine and
the amount of
async code was single quite short function, so we didn’t have
much motivation. And in part because we use the
spirit libraries for
configuration. It’s a library to take configuration and set up the internal
state of the program for it ‒ configuration contains the ports to listen to,
etc, and it manages spawning the HTTP server objects inside the program and can
even migrate from one set of ports to other at runtime.
spirit to newer
hyper was a big task (because
the API surface is quite large, the library does a bit unusual things compared
to all the usual applications and the change between old and new
Anyway, eventually I got permission to work on the migration of
part of my job. It took about a week to migrate both
spirit and urlite. It
went through review, went through the automatic tests and we put it to the
staging environment for a while, watching the logs and graphs. Everything seemed
fine, so after few days of everything looking fine, we pushed the button and put
it to production.
The increased latencies
As it goes in these kinds of stories, by now you’re expecting to see what went wrong.
The thing is, our own metrics and graphs were fine. But the latency on the downstream service querying us increased by 40ms. The deployment got reverted, and we started to dig into where these latencies come from.
It was acting really weird
There were several very suspicious things about that.
- Our own „internal“ latencies stayed the same. Our CPU usage also stayed the same.
- The latency graph on the downstream side was flat 40ms constant.
Now, if we introduced some performance regression in the query handling, we
would expect our CPU consumption to rise. We would also expect the latency graph
to be a bit spiky, not completely flat 40ms constant. It almost looked like
there was a 40ms
sleep somewhere. But why would anyone put a 40ms sleep
I’ve looked through documentation and didn’t see anything obvious. I’ve tried
searching both our code and code of the dependencies for
40, asked on Slack if
40ms value was familiar to anyone. Nothing.
The working theory we started with was that there could be some kind of back-off
sleep on some kind of failure. Maybe
hyper would be closing inactive
connections in the new version, forcing the application to reconnect (the graph
was for 99th percentile, so if we happened to close each 100th connection and
reconnection took this long…) and maybe try IPv6 first and we would be listening
on IPv4 only or… (in other words, we didn’t have a clear clue).
My colleague started to investigate in a more thorough way than just throwing
ideas around on Slack. He run a
wrk benchmark against the service. On his
machine, the latencies were fine. So he commandeered one of the stage nodes to
play with it and run the benchmark there. And every request had 40ms latency on
that machine. The previous version of
urlite was fine, with under one
Something was probably sleeping somewhere on the production servers, but not on the development machine. There probably was some difference in the OS settings, but definitely difference in the kernel version. The servers are running some well-tried Linux distribution, so they have a lot older version, while a developer is likely to run something much more on the edge.
The selling point of
spirit is that most of the thing one can set in the
builders of various libraries and types now can be just put into a config file
without any recompilation. If we did configuration by hand, we would expose only
the things we thought would be useful, but with spirit, there’s everything. So,
naturally, tweaking some of the config knobs was the next step (because it was
easy to do).
It was discovered that turning the
http1-writev option off, which
hyper, made the latencies go away.
We now had a solution, but I wasn’t happy about not understanding why that helped, so I went to dig the rabbit hole and find the root cause. Turns out there were several little things in just the constellation to make the problem manifest.
Overriding the defaults of
The method controls the strategy in which way data are pushed into the socket.
With vectored writes enabled, it sends two separate buffers (one with headers,
the other with the body if it’s small enough) through a single
syscall. If they are disabled,
hyper copies all the bytes into a single
buffer and sends that as a whole.
It turns out that the method takes a
bool, but there are actually 3 states.
The third (auto) is signalled by not calling the method and
calling it always, with the default to turn the
writev on. I don’t know if
leaving it on auto would make the bug go away, but I’ve fixed the problem in the
Splitting vectored writes
Spirit wants to support a bit of configuration on top of what the underlying libraries provide on their own. One of such things is limiting the number of concurrently accepted connections on a single listening socket. Users don’t have to take advantage of that (the types for the configuration can be composed together to either contain that bit or not and the administrator may leave the values for the limits unset in the configuration and then they won’t be enforced).
Anyway, in case the support for the limits is opted in through using the type
with the configuration fields, the connections themselves are wrapped in a
type. The type tries to be mostly transparent for use and can be used inside
hyper (which is what the default configuration type alias in
does), but tracks how many connections there are, to not accept more if it runs
out of the limit.
When implementing the bunch of traits for the wrapper, I’ve overlooked the
It is a provided method, which means it has a default implementation. It is the
one that abstracts the OS-level
writev, it can write multiple buffers (the
Buf represents a
The default implementation simply delegates to multiple calls to the ordinary write. Therefore, this omission combined with the default of enabling vectored writes results in calling into the kernel twice, each time with a small buffer, instead of once with two small buffers or once with a big buffer.
That was definitely something to get fixed, because if nothing else, syscalls are expensive and calling more of them is not great. But I’ve finally felt like I’m on the right path, because:
You probably know that TCP stream is built from packets going there and back. Optimizing how to split the stream into the packets and when to send them is a hard problem and the research in that area is still ongoing, because there are many conflicting requirements. One wants to deliver the data with low latency, utilize the whole bandwidth, but not overflow the capacity of the link (in which case the latencies would go up or packets would get lost and would have to be retransmitted), leave some bandwidth to other connections, etc.
The Nagle’s algorithm is one of the older tricks up in TCP’s sleeve. The network doesn’t like small packets. It is better to send as large packets as the link allows, because each packet has certain overhead ‒ the headers that take some space, but also routers spend computation power mostly based on number of packets and less on their size. If you start sending a lot of tiny packets, the performance will suffer. While the links are limited by the number of bytes that can pass through them, routers are more limited by the number of packets. If a router declares to be able to handle a gigabit connection, it actually means a gigabit if you use full-sized packets, but will get to few megabits if you split the data into tiny packets.
So it would be better to wait until the send buffer contains a packetfull of data before sending anything. But we can’t do that, because we don’t know there’ll ever be a full packet of data, or generating more data might wait for the other side to answer. We would never send anything and wait forever and the Internet would not work.
Instead, the algorithm is willing to send one undersized packet and then it
waits for an
ACK from the other side until sending another undersized one. If
it gets a packetfull of data to send in the meantime, that’s great and it sends
it (it won’t get better by waiting longer), but it won’t ever have two
undersized ones somewhere in flight, therefore won’t kill the network’s
performance by them.
This works, it’ll make progress eventually because once the
ACK comes and
all the data that accumulated until then is sent.
But it also slows things down. Like in our case. What happens if we do it using single syscall, the whole HTTP response forms a single undersized packet (we have really small answers) and gets sent, no matter if it’s submitted to the kernel by one big or two small buffers.
On the other hand, if we split the submission into two syscalls, this is what happens:
- We write the first part (headers). The kernel sends them out as an undersized packet.
- We write the second part (the body). But the kernel shelves them into the send
buffer and waits for sending them until it sees the
ACK, because there’s one undersized packet somewhere in flight already. Here is our sleep.
This theory got confirmed by toggling the
tcp-nodelay option on (and leaving
writev on too, without the fix in the library applied), which also solved our
problem. This option turns the Nagle’s algorithm off and just sends all data out
as soon as possible (considering other things like buffers in the network card,
TCP windows, congestion algorithms…).
Ok, but why 40ms?
But wait, the other service lives on the same machine. It really can’t take that
long for the packet to pass through the loopback interface and an
return back. Hell, it should not take 40ms if it was a different datacenter on
the same continent. Not speaking about this taking exactly 40ms. Even if we
had really long loopback, it would not turn out to be 40ms exactly each time.
So there must be something more happening.
And indeed, there is. It’s called delayed ACKs and it’s also an optimisation of TCP. One that got introduced about the same time as Nagle’s algorithm, but without knowing about each other or considering how they would interact.
You see, one doesn’t want to send small packets (deja-vu?). A packet carrying
ACK and no data is certainly small and we don’t want to send it if we
can help that.
ACK can be used to acknowledge multiple received packets at once and it
can piggy-back on a packet with actual data in it, the receiver wants to wait a
little bit. The idea is, when data is received, it often provokes the
application to respond to it with some other data so we wait for that to happen,
then send the data with the
ACK. If not, we at last wait for a second
packet, to acknowledge two at once. Only if we don’t get any response data nor
more incoming packets we decide to send that
ACK eventually. After all, the
sender doesn’t really need that
ACK fast, does it, because if it would have
more data to send, it would send at least the whole TCP window, which is several
Except that the other side might be waiting for it to send more data because of the Nagle’s algorithm. That sucks.
And indeed, a little bit of searching through the Internet actually finds a kernel patch talking about delayed ACKs and the value of 40ms. Bingo, this can’t be coincidence.
I have reasons to believe this interaction got somehow solved in newer kernels (I don’t know how, but apparently, it no longer happens on our development machines). And that 40ms timeout is something the client side chooses.
Well, sometimes the situation happens to be a really complex mess that results in some weird interaction (and make a good material for a blog post). The above combination of the TCP optimisations is described on the Internet, but I’d say it’s not widely known one. That getting triggered through defaulting to a less optimal but generally good enough implementation is even bigger chance. And that such a small detail can make your optimized service 40 times slower.
Nevertheless, our takeaway, apart from fixing the discovered issues is:
- We started to watch not only our own graphs, but graphs of other services around and have alerts for them. Our own graphs just can’t contain delays introduced by packets sleeping somewhere in the kernel.
- We now run with
tcp-nodelayenabled. Even when the bug is now fixed, the interaction is still technically possible to hit in case the client uses pipelining ‒ the client could send two HTTP requests in a row without waiting for the first response in between, possibly in the same TCP packet. Then we would have written the first response and hit the problem after writing the second one.
Obviously, Rust doesn’t catch all the bugs, this one have slipped through the
compiler 😈. And while we have discovered the cause reasonably fast, it also
required a bit of luck (the toggling of the
writev option was just random
playing around, not really something we had reasons to believe would help).