Serialization and deserialization of large messages can be slow #166

spenczar · 2019-03-25T15:03:44Z

As mentioned in #165, very large messages can slow down a Twirp server due to serialization and deserialization costs.

More detail is needed here: what are the profiling hotspots? It sounds like byte slice allocation is a big one; if so, memory pooling could be useful.

vmg · 2019-03-26T09:21:57Z

Morning! So, following up on the other PR.

Problem Statement

The rough problem statement is: when working with a Twirp server in Go that handles very large ProtoBuf messages, there's significant overhead caused in the serialization and de-serialization code.

This overhead comes from two main sources:

twirp/example/service.twirp.go

Lines 269 to 280 in c2eb369

    
           buf, err := ioutil.ReadAll(req.Body) 
        
           if err != nil { 
        
           	err = wrapErr(err, "failed to read request body") 
        
           	s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
        
           	return 
        
           } 
        
           reqContent := new(Size) 
        
           if err = proto.Unmarshal(buf, reqContent); err != nil { 
        
           	err = wrapErr(err, "failed to parse request proto") 
        
           	s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
        
           	return 
        
           }

The serialization code is the hottest in both memory and CPU profiles. This is because we're using ioutil.ReadAll, which basically wraps bytes.NewBuffer & (*bytes.Buffer).ReadFrom. In this case, we're always allocating large amounts of memory, and because we do not know beforehand the total size of the incoming object, the backing slice in Buffer has to be reallocated several times on average for every request. This causes a lot of churn, making line 269 the top allocation in a normal request path.

Let's take a look at a couple memory profiles. Here GoogleMessage1 is a complex ProtoBuf type from Google's benchmarks and Bench2Request is an object with large buffers from one of our internal services:

    508            .     1.16GB           	buf, err := ioutil.ReadAll(req.Body) 
    509            .          .           	if err != nil { 
    510            .          .           		err = wrapErr(err, "failed to read request body") 
    511            .          .           		s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
    512            .          .           		return 
    513            .          .           	} 
    514      53.02MB    53.02MB           	reqContent := new(GoogleMessage1Gogo) 
    515            .   193.02MB           	if err = proto.Unmarshal(buf, reqContent); err != nil { 
    516            .          .           		err = wrapErr(err, "failed to parse request proto") 
    517            .          .           		s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
    518            .          .           		return 
    519            .          .           	} 
    520            .          .

    796            .     2.35GB           	buf, err := ioutil.ReadAll(req.Body) 
    797            .          .           	if err != nil { 
    798            .          .           		err = wrapErr(err, "failed to read request body") 
    799            .          .           		s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
    800            .          .           		return 
    801            .          .           	} 
    802            .          .           	reqContent := new(Bench2RequestGogo) 
    803            .     1.15GB           	if err = proto.Unmarshal(buf, reqContent); err != nil { 
    804            .          .           		err = wrapErr(err, "failed to parse request proto") 
    805            .          .           		s.writeError(ctx, resp, twirp.InternalErrorWith(err)) 
    806            .          .           		return 
    807            .          .           	}

The most interesting thing here is that, because of the way bytes.Buffer grows for each request, we allocate significantly more memory in temporary scratch buffers to parse an object than in the actual parsing!. This is particularly noticeable in the complex ProtoBuf object from Google, where most of the fields are embedded in the GoogleMessage1Gogo struct. We allocate 1,16GB in buffers to parse 250mb of objects. That's a lot of churn!

The result of that churn can clearly be seen in the CPU profile graph:

 flat  flat%   sum%        cum   cum%
    1180ms 24.84% 24.84%     1180ms 24.84%  runtime.memmove // Scratch allocation
     950ms 20.00% 44.84%      950ms 20.00%  unicode/utf8.ValidString // ProtoBuf parsing
     580ms 12.21% 57.05%      580ms 12.21%  runtime.memclrNoHeapPointers // Scratch allocation
     270ms  5.68% 62.74%     1690ms 35.58%  github.com/gogo/protobuf/proto.appendUTF8StringSlice // ProtoBuf parsing
     250ms  5.26% 68.00%      320ms  6.74%  github.com/github/twirp-bench/pkg/pb.(*HighlightedGogo).Size
     240ms  5.05% 73.05%      240ms  5.05%  github.com/gogo/protobuf/proto.appendVarint
     150ms  3.16% 76.21%      720ms 15.16%  github.com/github/twirp-bench/pkg/pb.(*HighlightedGogo).MarshalTo
     150ms  3.16% 79.37%      380ms  8.00%  runtime.scanobject
     110ms  2.32% 81.68%      130ms  2.74%  runtime.findObject

GC + memory allocation just overwhelms the actual parsing time of the ProtoBuf objects in synthetic benchmarks. In a production service benchmark that performs operations on the buffers, GC + memory allocation is no longer number 1, but it also dominates most of the top10. runtime.memmove, caused from re-allocating the buffers when parsing, shows up between number 2 (!!!) and 5 depending on the message type. I think that's a problem!

Possible Solutions

Let's enumerate the possible solutions which I've considered and which @spenczar proposed in the previous PR.

Do nothing: we could just wait until the Go GC improves significantly and the overhead of all this memory allocation is reduced. I'm not sure how long is this going to take, and I'm not sure how good the results will be, because as seen on the profiles most of the churn is caused by re-allocating buffers, not by GCing them.
User-supplied memory pools: this is the approach that gorilla/websockets took, and the one I implemented in Add optional support for memory pooling #165. I think this is the lowest-effort change that produces a significant improvement in performance when dealing with large messages. It has some shortcomings:
- We are asking users to explicitly opt-in into this feature, which may not be the ideal ergonomics.
- The total memory usage of the Twirp service is slightly higher (because we're always pooling the largest buffers we've allocated) than without those changes. Obviously although total memory usage (RSS) is higher, the actual allocations per request are minimized/removed completely. This is the tradeoff of memory pooling, and some users may not want to do it.
- It's not as smart as it could be, again because we're always pooling the largest objects.
Keeping track of typical request/response sizes inside the server and using buffer pools that are hidden entirely: this is an interesting approach. The Go stdlib does a similar thing for HTTP/2 request pooling: https://github.com/golang/go/blob/7e394a2/src/net/http/h2_bundle.go#L998-L1043

They have a huge advantage, which is that HTTP/2 frames have a maximum possible sizes, so they can chunk the pooling into a reasonable range. The messages in a Twirp service can have arbitrary sizes, so coming up with a default set of buckets sounds much harder -- unless we detect the different bucket sizes at runtime. I'm not sure how expensive and/or complex would that be.
Pooling proto.Buffer objects: As suggested by @spenczar in the other PR, I'm not quite sure of the advantage of this approach. proto.Buffer objects are a simple wrapper over a []byte slice, and they can be used to serialize ProtoBuf objects into them, but we cannot de-serialize ProtoBuf objects from them because we have no efficient way to "fill them" with the contents of the object. proto.NewBuffer(bytes).Unmarshal() is what one would do to de-serialize, but in order to read the bytes from the HTTP request, we would need e.g. a bytes.Buffer, having the same issue we previously had with memory churn. That is the reason why I decided to pool []bytes in my PR, and then use them to bootstrap either bytes.Buffer or proto.Buffer objects for (de)serialization. It's the lowest common denominator.
Implementing a streaming deserializer: The ProtoBuf authors at Google are against this approach; they don't think it's going to improve performance in any meaningful way. They're probably right! Streaming is complex and expensive!

In fact, what they suggest instead is, well, what I implemented in Add optional support for memory pooling #165: proto: add streaming APIs for Unmarshal/Marshal golang/protobuf#507

I think that's a good summary of where are we at! Are we missing something? From my point of view: large messages are clearly a performance issue because of the (re)allocations. The least painful way to improve this bottleneck is memory pooling. The biggest open questions are whether we should make this memory pooling opt-in, whether it should be always enabled by default, how are going to do the pooling, and what kind of objects are we going to pool.

I would love to hear your thoughts on this!

spenczar · 2019-03-26T18:43:20Z

Thanks, terrific write-up and just what I was hoping for.

I'd like to set some additional boundaries on this problem:

Any solution needs to be backwards-compatible.
Optional configuration is evil; we should set things for everyone if at all possible. Sometimes evil is unavoidable, but we should try.
Performance for the high-RPS, small-message-size case must not be negatively impacted.

I have an additional solution to consider:

Content-Length Header: Since the issue seems to be mostly around the way ioutil.ReadAll works, I wonder if we can do something simpler. If the client sets the Content-Length header (and if we can trust the client!) then we can allocate a slice of exactly the right size rather than growing the byte slice over and over, as ioutil.ReadAll will do.

This would mean there's just one allocation per unmarshal. However, Content-Length is spoofable; we'd have to be dubious in case the client asks us to allocate 10GB or whatever. Is there a safe level we can allocate up to, like 10MB max?

I'd be very wary of reaching too quickly for sync.Pool:

First, it can have non-linear and unexpected negative consequences for garbage collection performance. golang/go#22950 is a good summary of the problem, and golang/go#14812 (comment) describes a very similar application of pools that ends up reducing a service's throughput. You gain a little from reducing allocations but may pay a lot on GC assist CPU usage, and can pay out the nose due to the way pool cleanup is implemented. This is a long-term issue that has been known since 2013 and doesn't look likely to be resolved any time very soon.

Second, variable-sized workloads can be very bad for sync.Pool's performance, but different RPCs will have different sized requests and responses (and, indeed, clients may send wildly differently-sized messages for application-specific reasons). @dsnet has summarized the problem in golang/go#23199; since the pool returns a random object, a few large requests can "poison" a pool and make the heap get really big (and introduce GC pressure that slows down your application!). However, we might be able to address this by using many fixed-size buffers in the pool, and pulling them out as needed (like, [512]byte arrays, and to read a 4kB-long message, you'd need to call Get() 8 times).

Third, I am worried about lock contention on high-RPS services. Today, sync.Pool involves a global mutex over all pools being used as well as per-P local locks. For services that handle many requests, lock contention is a serious concern. https://go-review.googlesource.com/c/go/+/166960/ might help address this, but I wouldn't feel great about depending upon it landing.

These aren't necessarily dealbreakers, but they're worrying. I think we need to have profiles and/or benchmarks of different load patterns before we can say whether memory pools are actually a net win for most bottlenecked services.

dsnet · 2019-03-26T19:27:21Z

This would mean there's just one allocation per unmarshal. However, Content-Length is spoofable; we'd have to be dubious in case the client asks us to allocate 10GB or whatever. Is there a safe level we can allocate up to, like 10MB max?

Even a max of 10MiB is difficult, since it means that an adversary only needs send 100 small requests, each claiming to be 10MiB, causing the server to allocate 1GiB. An important principle to DDOS protection is that the attacker has to at least spend proportional resources to what they are causing you to waste.

The implementation of Go protobufs is unique in that we operate on a single []byte, which fundamentally forces users to have to answer the difficult question of how large of a buffer do they need to contain all of that? I'm increasingly convinced that we need to add scattered read/write APIs for protobufs (see golang/protobuf#609). Having an API like that means that the user doesn't have to answer the question about up-front buffer size. They just collect a series of buffer chunks, allocating as necessary, without needing to discard previous work.

I should note that protobufs for C++ and Java operate on scatter/gather like APIs for buffers.

spenczar · 2019-03-26T20:45:16Z

Yeah, @dsnet, you have me convinced that trusting Content-Length is probably a dead-end. Doesn't this problem exist to some extent for any protobuf-based protocol, since strings and bytes and embedded messages are length-prefixed? How does the github.com/golang/protobuf/proto unmarshaler avoid allocating 1GiB when a (possibly untrusted) message promises a 1GiB string, but never actually delivers it?

golang/protobuf#609 sounds like a nice strategy, but if @dsnet doesn't want to move forward on it until readv support arrives (golang/go#17607), we could be waiting a long time - and even then, it would depend upon the bleeding edge of the standard library; Twirp promises compatibility with the last two minor versions of Go, so we would be at least a year away from being able to use it, if readv support is a blocker.

@dfawley's suggestion in golang/protobuf#609 (comment) is what I had in mind when I suggested a streaming deserializer.

Man, we don't have a lot of options here. The most promising option might be to use pools with fixed (large-ish) buffer sizes and go from there.

dsnet · 2019-03-26T20:50:02Z

How does the github.com/golang/protobuf/proto unmarshaler avoid allocating 1GiB when a (possibly untrusted) message promises a 1GiB string, but never actually delivers it?

The fact that protobuf takes a []byte implies that the user had to do that work for us (which is precisely what this issue is about, heh). So, when we're parsing a length-prefix, if the length says something larger than the remainder of the []byte input, then clearly something is wrong.

@dsnet doesn't want to move forward on it until readv support arrives (golang/go#17607)

It's possible that we move forward without it, or push more heavily to get it added in Go1.13 or Go1.14 (no promises). Alternative APIs that don't involve net.Buffers are also possible. @neild and I were talking about the benefits of scatter/gather APIs at lunch today.

vmg · 2019-04-03T08:56:06Z

Back on this thread. Sorry, I've been busy! 😅

I am not particularly opposed to Content-Length because we're using Twirp in a trusted environment, but I do see how it can lead to serious DoS issues and maybe we should just discard it as an option.

It seems like our only realistic options are waiting for golang/protobuf#609 (which may take a while), or write a smarter self-adjusting memory pool that is enabled by default and that works well with small messages. I think a smarter pool is doable and will provide solid benefits right away. @spenczar: would you like me to submit something like that as a PR?

spenczar · 2019-04-03T19:34:41Z

@vmg A smarter pool could work. I'd be interested in a design for one, yes. If you think it's easiest to talk over a design in PR form, that's fine by me; discussion in this issue is good too.

vmg · 2019-04-04T07:51:33Z

What I had in mind is something similar to the pool that fasthttp uses: https://github.com/valyala/bytebufferpool/blob/master/pool.go

I think we would need to base it off this code because we would want to pool proto.Buffer objects instead of their custom buffer type, but the approach of calibrating the size of the buffers in the pool based on recent request sizes seems efficient and sensible.

What do you think?

spenczar · 2019-06-11T17:21:07Z

Sorry for the long silence here.

That code looks alright, but it's fairly complicated. It adds a bunch of atomic calls, which could require synchronization during high-throughput work. I think we should go carefully. Can we gather real-world, production benchmarks from a version of Twirp that uses that pool before committing to it?

github-actions · 2020-12-03T00:12:41Z

This issue is stale because it has been open 60 days with no activity. Remove stale label / comment or this will be closed in 5 days

bavovna mentioned this issue Oct 14, 2020

performance fix: protobuf client ioutil.ReadAll #282

Closed

github-actions bot added the no-issue-activity label Dec 3, 2020

github-actions bot closed this as completed Dec 9, 2020

swiftyspiffy mentioned this issue Aug 23, 2022

Fasthttp support #366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization and deserialization of large messages can be slow #166

Serialization and deserialization of large messages can be slow #166

spenczar commented Mar 25, 2019

vmg commented Mar 26, 2019

spenczar commented Mar 26, 2019 •

edited

Loading

dsnet commented Mar 26, 2019 •

edited

Loading

spenczar commented Mar 26, 2019 •

edited

Loading

dsnet commented Mar 26, 2019 •

edited

Loading

vmg commented Apr 3, 2019

spenczar commented Apr 3, 2019

vmg commented Apr 4, 2019

spenczar commented Jun 11, 2019

github-actions bot commented Dec 3, 2020

Serialization and deserialization of large messages can be slow #166

Serialization and deserialization of large messages can be slow #166

Comments

spenczar commented Mar 25, 2019

vmg commented Mar 26, 2019

Problem Statement

Possible Solutions

spenczar commented Mar 26, 2019 • edited Loading

dsnet commented Mar 26, 2019 • edited Loading

spenczar commented Mar 26, 2019 • edited Loading

dsnet commented Mar 26, 2019 • edited Loading

vmg commented Apr 3, 2019

spenczar commented Apr 3, 2019

vmg commented Apr 4, 2019

spenczar commented Jun 11, 2019

github-actions bot commented Dec 3, 2020

spenczar commented Mar 26, 2019 •

edited

Loading

dsnet commented Mar 26, 2019 •

edited

Loading

spenczar commented Mar 26, 2019 •

edited

Loading

dsnet commented Mar 26, 2019 •

edited

Loading