40 millisecond bug

This is a small story about tracking down a production bug in a Rust application. I don’t know if there’s any take away from this one for the reader, but it felt interesting so I’m sharing it.

A bit of backstory

In Avast, we have a Rust application called urlite. It serves as a backend to some other applications, provides them a HTTP API. It’s in Rust because it is latency critical. Latencies of most requests are under a millisecond.

It was written with tokio-0.1 and hyper-0.12 to deal with the HTTP. We were quite late to update to newer versions, in part because it worked fine and the amount of async code was single quite short function, so we didn’t have much motivation. And in part because we use the spirit libraries for configuration. It’s a library to take configuration and set up the internal state of the program for it ‒ configuration contains the ports to listen to, etc, and it manages spawning the HTTP server objects inside the program and can even migrate from one set of ports to other at runtime.

But migrating spirit to newer tokio and hyper was a big task (because the API surface is quite large, the library does a bit unusual things compared to all the usual applications and the change between old and new tokio was quite large).

Anyway, eventually I got permission to work on the migration of spirit as part of my job. It took about a week to migrate both spirit and urlite. It went through review, went through the automatic tests and we put it to the staging environment for a while, watching the logs and graphs. Everything seemed fine, so after few days of everything looking fine, we pushed the button and put it to production.

The increased latencies

As it goes in these kinds of stories, by now you’re expecting to see what went wrong.

The thing is, our own metrics and graphs were fine. But the latency on the downstream service querying us increased by 40ms. The deployment got reverted, and we started to dig into where these latencies come from.

It was acting really weird

There were several very suspicious things about that.

Our own „internal“ latencies stayed the same. Our CPU usage also stayed the same.
The latency graph on the downstream side was flat 40ms constant.

Now, if we introduced some performance regression in the query handling, we would expect our CPU consumption to rise. We would also expect the latency graph to be a bit spiky, not completely flat 40ms constant. It almost looked like there was a 40ms sleep somewhere. But why would anyone put a 40ms sleep anywhere?

I’ve looked through documentation and didn’t see anything obvious. I’ve tried searching both our code and code of the dependencies for 40, asked on Slack if that 40ms value was familiar to anyone. Nothing.

The working theory we started with was that there could be some kind of back-off sleep on some kind of failure. Maybe hyper would be closing inactive connections in the new version, forcing the application to reconnect (the graph was for 99th percentile, so if we happened to close each 100th connection and reconnection took this long…) and maybe try IPv6 first and we would be listening on IPv4 only or… (in other words, we didn’t have a clear clue).

The benchmarks

My colleague started to investigate in a more thorough way than just throwing ideas around on Slack. He run a wrk benchmark against the service. On his machine, the latencies were fine. So he commandeered one of the stage nodes to play with it and run the benchmark there. And every request had 40ms latency on that machine. The previous version of urlite was fine, with under one millisecond.

Something was probably sleeping somewhere on the production servers, but not on the development machine. There probably was some difference in the OS settings, but definitely difference in the kernel version. The servers are running some well-tried Linux distribution, so they have a lot older version, while a developer is likely to run something much more on the edge.

Configuration options

The selling point of spirit is that most of the thing one can set in the builders of various libraries and types now can be just put into a config file without any recompilation. If we did configuration by hand, we would expose only the things we thought would be useful, but with spirit, there’s everything. So, naturally, tweaking some of the config knobs was the next step (because it was easy to do).

It was discovered that turning the http1-writev option off, which corresponds to this method in hyper, made the latencies go away.

We now had a solution, but I wasn’t happy about not understanding why that helped, so I went to dig the rabbit hole and find the root cause. Turns out there were several little things in just the constellation to make the problem manifest.

Overriding the defaults of `http1_writev`

The method controls the strategy in which way data are pushed into the socket. With vectored writes enabled, it sends two separate buffers (one with headers, the other with the body if it’s small enough) through a single writev syscall. If they are disabled, hyper copies all the bytes into a single buffer and sends that as a whole.

It turns out that the method takes a bool, but there are actually 3 states. The third (auto) is signalled by not calling the method and spirit-hyper was calling it always, with the default to turn the writev on. I don’t know if leaving it on auto would make the bug go away, but I’ve fixed the problem in the library anyway.

Splitting vectored writes

Spirit wants to support a bit of configuration on top of what the underlying libraries provide on their own. One of such things is limiting the number of concurrently accepted connections on a single listening socket. Users don’t have to take advantage of that (the types for the configuration can be composed together to either contain that bit or not and the administrator may leave the values for the limits unset in the configuration and then they won’t be enforced).

Anyway, in case the support for the limits is opted in through using the type with the configuration fields, the connections themselves are wrapped in a Tracked type. The type tries to be mostly transparent for use and can be used inside hyper (which is what the default configuration type alias in spirit-hyper does), but tracks how many connections there are, to not accept more if it runs out of the limit.

When implementing the bunch of traits for the wrapper, I’ve overlooked the AsyncWrite::poll_write_buff. It is a provided method, which means it has a default implementation. It is the one that abstracts the OS-level writev, it can write multiple buffers (the Buf represents a segmented buffer).

The default implementation simply delegates to multiple calls to the ordinary write. Therefore, this omission combined with the default of enabling vectored writes results in calling into the kernel twice, each time with a small buffer, instead of once with two small buffers or once with a big buffer.

That was definitely something to get fixed, because if nothing else, syscalls are expensive and calling more of them is not great. But I’ve finally felt like I’m on the right path, because:

Nagle’s algorithm

You probably know that TCP stream is built from packets going there and back. Optimizing how to split the stream into the packets and when to send them is a hard problem and the research in that area is still ongoing, because there are many conflicting requirements. One wants to deliver the data with low latency, utilize the whole bandwidth, but not overflow the capacity of the link (in which case the latencies would go up or packets would get lost and would have to be retransmitted), leave some bandwidth to other connections, etc.

The Nagle’s algorithm is one of the older tricks up in TCP’s sleeve. The network doesn’t like small packets. It is better to send as large packets as the link allows, because each packet has certain overhead ‒ the headers that take some space, but also routers spend computation power mostly based on number of packets and less on their size. If you start sending a lot of tiny packets, the performance will suffer. While the links are limited by the number of bytes that can pass through them, routers are more limited by the number of packets. If a router declares to be able to handle a gigabit connection, it actually means a gigabit if you use full-sized packets, but will get to few megabits if you split the data into tiny packets.

So it would be better to wait until the send buffer contains a packetfull of data before sending anything. But we can’t do that, because we don’t know there’ll ever be a full packet of data, or generating more data might wait for the other side to answer. We would never send anything and wait forever and the Internet would not work.

Instead, the algorithm is willing to send one undersized packet and then it waits for an ACK from the other side until sending another undersized one. If it gets a packetfull of data to send in the meantime, that’s great and it sends it (it won’t get better by waiting longer), but it won’t ever have two undersized ones somewhere in flight, therefore won’t kill the network’s performance by them.

This works, it’ll make progress eventually because once the ACK comes and all the data that accumulated until then is sent.

But it also slows things down. Like in our case. What happens if we do it using single syscall, the whole HTTP response forms a single undersized packet (we have really small answers) and gets sent, no matter if it’s submitted to the kernel by one big or two small buffers.

On the other hand, if we split the submission into two syscalls, this is what happens:

We write the first part (headers). The kernel sends them out as an undersized packet.
We write the second part (the body). But the kernel shelves them into the send buffer and waits for sending them until it sees the ACK, because there’s one undersized packet somewhere in flight already. Here is our sleep.

This theory got confirmed by toggling the tcp-nodelay option on (and leaving writev on too, without the fix in the library applied), which also solved our problem. This option turns the Nagle’s algorithm off and just sends all data out as soon as possible (considering other things like buffers in the network card, TCP windows, congestion algorithms…).

Ok, but why 40ms?

But wait, the other service lives on the same machine. It really can’t take that long for the packet to pass through the loopback interface and an ACK to return back. Hell, it should not take 40ms if it was a different datacenter on the same continent. Not speaking about this taking exactly 40ms. Even if we had really long loopback, it would not turn out to be 40ms exactly each time.

So there must be something more happening.

And indeed, there is. It’s called delayed ACKs and it’s also an optimisation of TCP. One that got introduced about the same time as Nagle’s algorithm, but without knowing about each other or considering how they would interact.

You see, one doesn’t want to send small packets (deja-vu?). A packet carrying only an ACK and no data is certainly small and we don’t want to send it if we can help that.

As an ACK can be used to acknowledge multiple received packets at once and it can piggy-back on a packet with actual data in it, the receiver wants to wait a little bit. The idea is, when data is received, it often provokes the application to respond to it with some other data so we wait for that to happen, then send the data with the ACK. If not, we at last wait for a second packet, to acknowledge two at once. Only if we don’t get any response data nor more incoming packets we decide to send that ACK eventually. After all, the sender doesn’t really need that ACK fast, does it, because if it would have more data to send, it would send at least the whole TCP window, which is several packets long.

Except that the other side might be waiting for it to send more data because of the Nagle’s algorithm. That sucks.

And indeed, a little bit of searching through the Internet actually finds a kernel patch talking about delayed ACKs and the value of 40ms. Bingo, this can’t be coincidence.

I have reasons to believe this interaction got somehow solved in newer kernels (I don’t know how, but apparently, it no longer happens on our development machines). And that 40ms timeout is something the client side chooses.

Conclusion

Well, sometimes the situation happens to be a really complex mess that results in some weird interaction (and make a good material for a blog post). The above combination of the TCP optimisations is described on the Internet, but I’d say it’s not widely known one. That getting triggered through defaulting to a less optimal but generally good enough implementation is even bigger chance. And that such a small detail can make your optimized service 40 times slower.

Nevertheless, our takeaway, apart from fixing the discovered issues is:

We started to watch not only our own graphs, but graphs of other services around and have alerts for them. Our own graphs just can’t contain delays introduced by packets sleeping somewhere in the kernel.
We now run with tcp-nodelay enabled. Even when the bug is now fixed, the interaction is still technically possible to hit in case the client uses pipelining ‒ the client could send two HTTP requests in a row without waiting for the first response in between, possibly in the same TCP packet. Then we would have written the first response and hit the problem after writing the second one.

Obviously, Rust doesn’t catch all the bugs, this one have slipped through the compiler 😈. And while we have discovered the cause reasonably fast, it also required a bit of luck (the toggling of the writev option was just random playing around, not really something we had reasons to believe would help).