40 millisecond bug
This is a small story about tracking down a production bug in a Rust application. I don’t know if there’s any take away from this one for the reader, but it felt interesting so I’m sharing it.
A bit of backstory
In Avast, we have a Rust application called urlite. It serves as a backend to some other applications, provides them a HTTP API. It’s in Rust because it is latency critical. Latencies of most requests are under a millisecond.
It was written with tokio-0.1
and hyper-0.12
to deal with the HTTP. We
were quite late to update to newer versions, in part because it worked fine and
the amount of async
code was single quite short function, so we didn’t have
much motivation. And in part because we use the spirit
libraries for
configuration. It’s a library to take configuration and set up the internal
state of the program for it ‒ configuration contains the ports to listen to,
etc, and it manages spawning the HTTP server objects inside the program and can
even migrate from one set of ports to other at runtime.
But migrating spirit
to newer tokio
and hyper
was a big task (because
the API surface is quite large, the library does a bit unusual things compared
to all the usual applications and the change between old and new tokio
was
quite large).
Anyway, eventually I got permission to work on the migration of spirit
as
part of my job. It took about a week to migrate both spirit
and urlite. It
went through review, went through the automatic tests and we put it to the
staging environment for a while, watching the logs and graphs. Everything seemed
fine, so after few days of everything looking fine, we pushed the button and put
it to production.
The increased latencies
As it goes in these kinds of stories, by now you’re expecting to see what went wrong.
The thing is, our own metrics and graphs were fine. But the latency on the downstream service querying us increased by 40ms. The deployment got reverted, and we started to dig into where these latencies come from.
It was acting really weird
There were several very suspicious things about that.
- Our own „internal“ latencies stayed the same. Our CPU usage also stayed the same.
- The latency graph on the downstream side was flat 40ms constant.
Now, if we introduced some performance regression in the query handling, we
would expect our CPU consumption to rise. We would also expect the latency graph
to be a bit spiky, not completely flat 40ms constant. It almost looked like
there was a 40ms sleep
somewhere. But why would anyone put a 40ms sleep
anywhere?
I’ve looked through documentation and didn’t see anything obvious. I’ve tried
searching both our code and code of the dependencies for 40
, asked on Slack if
that 40ms
value was familiar to anyone. Nothing.
The working theory we started with was that there could be some kind of back-off
sleep on some kind of failure. Maybe hyper
would be closing inactive
connections in the new version, forcing the application to reconnect (the graph
was for 99th percentile, so if we happened to close each 100th connection and
reconnection took this long…) and maybe try IPv6 first and we would be listening
on IPv4 only or… (in other words, we didn’t have a clear clue).
The benchmarks
My colleague started to investigate in a more thorough way than just throwing
ideas around on Slack. He run a wrk
benchmark against the service. On his
machine, the latencies were fine. So he commandeered one of the stage nodes to
play with it and run the benchmark there. And every request had 40ms latency on
that machine. The previous version of urlite
was fine, with under one
millisecond.
Something was probably sleeping somewhere on the production servers, but not on the development machine. There probably was some difference in the OS settings, but definitely difference in the kernel version. The servers are running some well-tried Linux distribution, so they have a lot older version, while a developer is likely to run something much more on the edge.
Configuration options
The selling point of spirit
is that most of the thing one can set in the
builders of various libraries and types now can be just put into a config file
without any recompilation. If we did configuration by hand, we would expose only
the things we thought would be useful, but with spirit, there’s everything. So,
naturally, tweaking some of the config knobs was the next step (because it was
easy to do).
It was discovered that turning the http1-writev
option off, which
corresponds to
this method
in hyper
, made the latencies go away.
We now had a solution, but I wasn’t happy about not understanding why that helped, so I went to dig the rabbit hole and find the root cause. Turns out there were several little things in just the constellation to make the problem manifest.
Overriding the defaults of http1_writev
The method controls the strategy in which way data are pushed into the socket.
With vectored writes enabled, it sends two separate buffers (one with headers,
the other with the body if it’s small enough) through a single writev
syscall. If they are disabled, hyper
copies all the bytes into a single
buffer and sends that as a whole.
It turns out that the method takes a bool
, but there are actually 3 states.
The third (auto) is signalled by not calling the method and spirit-hyper
was
calling it always, with the default to turn the writev
on. I don’t know if
leaving it on auto would make the bug go away, but I’ve fixed the problem in the
library anyway.
Splitting vectored writes
Spirit wants to support a bit of configuration on top of what the underlying libraries provide on their own. One of such things is limiting the number of concurrently accepted connections on a single listening socket. Users don’t have to take advantage of that (the types for the configuration can be composed together to either contain that bit or not and the administrator may leave the values for the limits unset in the configuration and then they won’t be enforced).
Anyway, in case the support for the limits is opted in through using the type
with the configuration fields, the connections themselves are wrapped in a
Tracked
type. The type tries to be mostly transparent for use and can be used inside
hyper
(which is what the default configuration type alias in spirit-hyper
does), but tracks how many connections there are, to not accept more if it runs
out of the limit.
When implementing the bunch of traits for the wrapper, I’ve overlooked the
AsyncWrite::poll_write_buff
.
It is a provided method, which means it has a default implementation. It is the
one that abstracts the OS-level writev
, it can write multiple buffers (the
Buf
represents a
segmented buffer).
The default implementation simply delegates to multiple calls to the ordinary write. Therefore, this omission combined with the default of enabling vectored writes results in calling into the kernel twice, each time with a small buffer, instead of once with two small buffers or once with a big buffer.
That was definitely something to get fixed, because if nothing else, syscalls are expensive and calling more of them is not great. But I’ve finally felt like I’m on the right path, because:
Nagle’s algorithm
You probably know that TCP stream is built from packets going there and back. Optimizing how to split the stream into the packets and when to send them is a hard problem and the research in that area is still ongoing, because there are many conflicting requirements. One wants to deliver the data with low latency, utilize the whole bandwidth, but not overflow the capacity of the link (in which case the latencies would go up or packets would get lost and would have to be retransmitted), leave some bandwidth to other connections, etc.
The Nagle’s algorithm is one of the older tricks up in TCP’s sleeve. The network doesn’t like small packets. It is better to send as large packets as the link allows, because each packet has certain overhead ‒ the headers that take some space, but also routers spend computation power mostly based on number of packets and less on their size. If you start sending a lot of tiny packets, the performance will suffer. While the links are limited by the number of bytes that can pass through them, routers are more limited by the number of packets. If a router declares to be able to handle a gigabit connection, it actually means a gigabit if you use full-sized packets, but will get to few megabits if you split the data into tiny packets.
So it would be better to wait until the send buffer contains a packetfull of data before sending anything. But we can’t do that, because we don’t know there’ll ever be a full packet of data, or generating more data might wait for the other side to answer. We would never send anything and wait forever and the Internet would not work.
Instead, the algorithm is willing to send one undersized packet and then it
waits for an ACK
from the other side until sending another undersized one. If
it gets a packetfull of data to send in the meantime, that’s great and it sends
it (it won’t get better by waiting longer), but it won’t ever have two
undersized ones somewhere in flight, therefore won’t kill the network’s
performance by them.
This works, it’ll make progress eventually because once the ACK
comes and
all the data that accumulated until then is sent.
But it also slows things down. Like in our case. What happens if we do it using single syscall, the whole HTTP response forms a single undersized packet (we have really small answers) and gets sent, no matter if it’s submitted to the kernel by one big or two small buffers.
On the other hand, if we split the submission into two syscalls, this is what happens:
- We write the first part (headers). The kernel sends them out as an undersized packet.
- We write the second part (the body). But the kernel shelves them into the send
buffer and waits for sending them until it sees the
ACK
, because there’s one undersized packet somewhere in flight already. Here is our sleep.
This theory got confirmed by toggling the tcp-nodelay
option on (and leaving
writev
on too, without the fix in the library applied), which also solved our
problem. This option turns the Nagle’s algorithm off and just sends all data out
as soon as possible (considering other things like buffers in the network card,
TCP windows, congestion algorithms…).
Ok, but why 40ms?
But wait, the other service lives on the same machine. It really can’t take that
long for the packet to pass through the loopback interface and an ACK
to
return back. Hell, it should not take 40ms if it was a different datacenter on
the same continent. Not speaking about this taking exactly 40ms. Even if we
had really long loopback, it would not turn out to be 40ms exactly each time.
So there must be something more happening.
And indeed, there is. It’s called delayed ACKs and it’s also an optimisation of TCP. One that got introduced about the same time as Nagle’s algorithm, but without knowing about each other or considering how they would interact.
You see, one doesn’t want to send small packets (deja-vu?). A packet carrying
only an ACK
and no data is certainly small and we don’t want to send it if we
can help that.
As an ACK
can be used to acknowledge multiple received packets at once and it
can piggy-back on a packet with actual data in it, the receiver wants to wait a
little bit. The idea is, when data is received, it often provokes the
application to respond to it with some other data so we wait for that to happen,
then send the data with the ACK
. If not, we at last wait for a second
packet, to acknowledge two at once. Only if we don’t get any response data nor
more incoming packets we decide to send that ACK
eventually. After all, the
sender doesn’t really need that ACK
fast, does it, because if it would have
more data to send, it would send at least the whole TCP window, which is several
packets long.
Except that the other side might be waiting for it to send more data because of the Nagle’s algorithm. That sucks.
And indeed, a little bit of searching through the Internet actually finds a kernel patch talking about delayed ACKs and the value of 40ms. Bingo, this can’t be coincidence.
I have reasons to believe this interaction got somehow solved in newer kernels (I don’t know how, but apparently, it no longer happens on our development machines). And that 40ms timeout is something the client side chooses.
Conclusion
Well, sometimes the situation happens to be a really complex mess that results in some weird interaction (and make a good material for a blog post). The above combination of the TCP optimisations is described on the Internet, but I’d say it’s not widely known one. That getting triggered through defaulting to a less optimal but generally good enough implementation is even bigger chance. And that such a small detail can make your optimized service 40 times slower.
Nevertheless, our takeaway, apart from fixing the discovered issues is:
- We started to watch not only our own graphs, but graphs of other services around and have alerts for them. Our own graphs just can’t contain delays introduced by packets sleeping somewhere in the kernel.
- We now run with
tcp-nodelay
enabled. Even when the bug is now fixed, the interaction is still technically possible to hit in case the client uses pipelining ‒ the client could send two HTTP requests in a row without waiting for the first response in between, possibly in the same TCP packet. Then we would have written the first response and hit the problem after writing the second one.
Obviously, Rust doesn’t catch all the bugs, this one have slipped through the
compiler 😈. And while we have discovered the cause reasonably fast, it also
required a bit of luck (the toggling of the writev
option was just random
playing around, not really something we had reasons to believe would help).