Saving some allocations
I guess most of you know the feeling that some code is not completely optimal. It probably doesn’t matter much in practice, because there are more expensive things around somewhere else, but the feeling when looking at it and thinking „well, there must be some better way“. And then you go on a war with the code to come up with something that is hopefully still readable, but somewhat better in whatever metric you’re obsessing with at the time.
I often declare these kinds of wars on unneeded copies and allocations. Talking to the global allocator is not exactly cheap and making extra copies is just not elegant. Sometimes it means using a bump allocator, or even writing an extension for it. Sometimes it means extensive use of borrowing. Now I have a long-standing battle with few of these allocations in a service I work on in my day job. There seems to be a possibility this one will finally get resolved and here goes a description how.
What the code does
The service accepts requests over HTTP. The POST body is protobuf encoded.
The HTTP part is taken care of by hyper. For the protobuf, I’ve chosen prost, mostly because it makes the generated API look more like Rust than Java and because it Just Works without any hassle around installing plugins. And it seems popular too, which means it hopefully won’t fall into disrepair any time soon.
The code currently looks something like this (I’m replacing the specifics with a phone-book example and simplifying):
message Entry {
optional string name = 1;
optional string phone = 2;
optional string address = 3;
}
message Batch {
repeated Entry entries = 1;
}
async fn phonebook_submit(req: Request<Body>, limits: &Limits)
-> Result<Response<Body>, Error>
{
// Accumulate the whole body. But protect from evil or broken clients that
// send too much stuff, we don't want to eat all the RAM.
let mut data = Vec::new();
let mut body = req.into_body();
while let Some(chunk) = body.next().await {
let chunk = chunk?;
if data.len() + chunk.len() > limits.max_bytes {
return Err(TooLarge.into());
}
data.extend(chunk);
}
let decoded = Batch::decode(&data)?;
if decoded.etries.len() > limits.max_entries {
return Err(TooLarge.into());
}
process_submit(&decoded)?;
Ok(Response::builder().status(StatusCode::NO_CONTENT))
}
This code works. But I have this itching that it allocates and copies data around needlessly. Let’s look where.
Collecting the whole HTTP body
First it accumulates the whole body into a Vec
, because prost doesn’t allow
parsing incrementally (it can merge
, but this won’t handle a field spanning
across the buffer boundary). So every time we receive a request, new Vec
is
allocated and the data are copied into it. This allocates at least one, but may
reallocate to accommodate longer multi-chunk requests.
In practice, most requests are short and fit into a single chunk. And prost
doesn’t need a Vec
, using the chunk directly would be fine. The type of the
chunk is Bytes
, which implements Buf
, which is what the parser wants. So
we could somehow check if the body is indeed in a single chunk and have two
branches of the code, one that uses it directly and the other handling the
longer bodies with a Vec
. But that would make the code harder to read.
We could at least reuse the Vec
between requests and not allocate every time.
But we would still be copying the bytes around.
The Buf
trait allows implementations to be non-continuous. That is, it
doesn’t have to be one big slice, it can be a sequence of shorter ones. So if we
could instead just chain the chunks and create a one Buf
that represents
their sequence, everything would be nice and shiny.
The aggregate function in hyper actually does pretty much that, but it has three disadvantages:
- It returns
impl Buf
instead of concrete type. hyper in general has the habit of not wanting to give me concrete types, which makes it harder to pass the values around or store them (it pollutes code with generics, requires dynamic dispatch and boxing in traits ‒ which is another allocation I don’t want to have). - I want to have that defensive check of how much data I’ve received from the
client and abort it before all RAM is consumed. It seems the aggregate will
just consume everything, no matter how large the payload is. Checking some
kind of
Content-Length
header first doesn’t help, because with a chunked encoding it is not mandatory. If one uses aggregate in the server, a malicious client can just kill the server by sending too much data. - It doesn’t provide the other optimization I describe below at the moment (I’ve went to check the code). Though I might submit a PR to fix this, it should be quite simple.
For the above reasons, I’ve decided to roll my own type that is basically a
bunch of Buf
s chained together (well, actually, I have two slightly
different ones) and publish it in the bytes-utils crate. See the
SegmentedSlice
, which can be built on top of some SmallVec
of chunks.
That way if only few chunks arrive (we expect just one), we don’t do any extra
allocation on top the one that hyper does to create the chunks.
let mut chunks = SmallVec::<[Bytes; 4]>::new();
let mut body = req.into_body();
while let Some(chunk) = body.next().await {
let chunk = chunk?;
if data.len() + chunk.len() > limits.max_bytes {
return Err(TooLarge.into());
}
chunks.push(chunk);
}
// This one doesn't allocate and will present a Buf view into all the chunks
// we accumulated.
let data = SegmentedSlice::new(&mut chunks);
// And we can pass that do prost directly.
let decoded = Batch::decode(data)?;
Allocations to fill the protobuf messages
By default, the structures generated by prost will look something like this (I’m not including bunch of attributes it puts on them, they are not interesting right now):
struct Entry {
name: Option<String>,
phone: Option<String>,
address: Option<String>,
}
struct Batch {
entries: Vec<Entry>,
}
Each non-empty Vec
or String
is another separate allocation. In the case of
the strings that come from the input, they are just copied from there. In theory
prost could also borrow from the input data ‒ do something like this:
struct Entry<'a> {
name: Option<&'a str>,
...
}
Then it could just point into the input data instead of allocating and copying. But that would make working with the structures and the whole API more hairy, so the suggestion to allow that was turned down.
Nevertheless, a different feature is in progress. At the bottom, there’s the
copy_to_bytes
method from the bytes crate. If the type this is called on
is Bytes
, it’ll not allocate new instance and copy the bytes. The Bytes
type is something like Arc<[u8]>
on steroids and it can hand out owned
sub-slices of itself (increment the reference count, but point to subset of the
original allocation). So Bytes::copy_to_bytes
will just subslice itself and
hand out that without copying. That’s pretty cool.
And prost can take advantage of it, with the bytes
configuration method. For now this works only for bytes
fields in the
protobuf, but it seems some form of support is coming for strings too.
OK, so if we fast-forward to some future release of prost, we will likely get this struct instead (if we explicitly ask for it):
struct Entry {
name: Option<Bytes>,
...
}
Then, when parsing, it’ll call this copy_to_bytes
method. If the original
input is something like &[u8]
, it’ll just copy the bytes into a new
allocation, but if the original input is also Bytes
, it’ll just slice &
increment the reference count instead, which is cheap. Almost like the original
idea with borrowing from the input, but without any annoying lifetimes around.
But, isn’t our input some kind of SegmentedSlice
? It is, but I’ve taken the
care to also support the same optimization, at least where possible. If the
whole request (one call to copy_to_bytes
) can be fullfilled by a single
buffer, then it delegates to that one. If the underlying buffers are Bytes
(like in our example above), it’ll use their optimized version. Unfortunately,
if the request is across a buffer boundary, then there’s no option but to
allocate & copy. However, given that we usually expect to have just one chunk
(so there aren’t multiple buffers to start with) and even if there are, each
boundary can “kill” only one string/byte string, we have improved the situation
by a lot.
But String
is more comfortable to use than Bytes
If the generator can create structures with Bytes
in it, it helps the
allocations. But that loses all the fancy string manipulation methods, like
lines
, because it deals with arbitrary byte arrays.
For that reason, the crate I’ve publish also contains the Str
type. It’s a
thin wrapper around the Bytes
type, but dereferences to &str
and puts all
the methods back.
Furthermore, it adds few methods of its own. Many of the string methods return
sub-slices of &str
. But these are borrowed and therefore introduce lifetimes
into whatever structures they are put into. So there are few analogue methods
(like lines_bytes
) that return more instances of Str
instead of &str
.
These are still cheap to make (slightly less cheap than the &str
, but it is
only one reference count away) while providing an owned value.
I don’t know if prost will eventually allow some way to fully customize which
types it should generate ‒ that would, additionally, allow using some SmallVec
instead of Vec
for repeated
fields, to save allocations in case only few
values are present. If it would, it should be possible to have structs with
name: Option<Str>
in them. If not, it is still possible to create the Str
from Bytes
, even though that would be less convenient.
Does it help?
Since the part where prost replaces String
with Bytes
isn’t available yet,
I haven’t integrated this into the service. So I don’t know if it makes it in
any way measurably faster. But considering the processing part, after it is
received and parsed, is probably much larger than this, I don’t think it’ll be a
big improvement. But it’ll certainly make my itching about number of allocations
get better 😇. After all, who does performance optimizations because of actual
speed?
That being said, this isn’t necessarily an obvious win. The bytes crate is a
treasure box of various little clever tricks (if you use the crate, I really
recommend reading through its documentation). One of them is, if you use the
BytesMut
type to first generate the data (for example by reading them from a
socket) and then chop off the generated buffer in the form of Bytes
, you can
keep reading into the same BytesMut
. Under some circumstances ‒ like, if all
instances of the Bytes
that was handed out were dropped and you call
reserve
, the original, chopped-off part of buffer is reclaimed and can be
reused. So in the old code, if we keep the Vec
around, or if it grows
exponentially and has enough capacity most of the time, the bytes are copied
over, but a new allocation is not done, the same buffer is used. In the new way,
we keep storing the instances of Bytes
around, so they can’t be reused. So
every now and then the BytesMut
buffer runs out and a new one needs to be
allocated. However, if we expect only one chunk to arrive, this doesn’t happen
(there’s no chance to reuse anything).
Furthermore, if there was a big buffer to start with, and we have sliced a small
Bytes
slice out of it and keep it around, that keeps the whole original big
buffer allocated. This could bite us if we actually store for example one small
string out of the whole request in some kind of semi-permanent storage instead
of throwing everything out after processing. That way our memory consumption
would grow by all that unused but not returned buffer space. It’s not really a
memory leak in the strict sense ‒ there still is a pointer to the original
buffer and if we drop the small string, it is properly deallocated. But it would
waste the space nevertheless.
So I think using Bytes
and the utilities in the bytes-utils
crate can be
a double-edged sword. There are situations where it (at least theoretically)
could help when used right. But there are also situations where it could hurt.