Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: make it possible to observe chunk delimiters. #19

Closed
Lukasa opened this issue Nov 1, 2016 · 7 comments
Closed

Proposal: make it possible to observe chunk delimiters. #19

Lukasa opened this issue Nov 1, 2016 · 7 comments

Comments

@Lukasa
Copy link
Member

Lukasa commented Nov 1, 2016

Motivation: urllib3's stream method allows users to request that they be streamed each chunk as it arrives. This is very difficult to do with h11 as it currently stands, because while h11 will try to emit one Data event per chunk, if the buffer contains a partial chunk h11 will prefer to emit that Data event and empty the buffer than to sit on it.

This is an entirely defensible design decision: while urllib3's users seem to want to be able to receive the chunks as they come in, chunk delimiters are not supposed to be semantic. However, for better or worse there are some use-cases where it is very helpful to know where chunk delimiters are.

There are three ways I can see of doing this:

  1. Change h11's behaviour to emit NEED_DATA when a partial chunk is in the buffer, rather than a Data event for that partial chunk. This is probably inefficient in the case where people don't care about the chunk sizes, and also allows for pernicious behaviour where the user just keeps shoving data into h11's buffer without h11 ever being able to emit it.

    (I should note that this is basically what h2 does with DATA frames: it emits one DataReceived event per frame. This is less problematic for h2 because of SETTINGS_MAX_FRAME_SIZE, which limits the total memory cost of buffering an entire frame.)

  2. Add a flag to swap between the current mode and the mode described in (1), which defaults to the current mode. I think this is a bad idea, but I did want to bring it up for completeness' sake. This has all the downsides of (1) plus an extra bit of interface complexity and testing surface to go with it. Not recommended.

  3. Add a flag to Data events that signal whether they mark the completed end of a chunk: otherwise keep the current behaviour the same. This would allow tools like urllib3 that want to care about where the chunk boundaries are basically just do a tight loop on recv() until they see a Data event with end_chunk=True. Because of h11's current semantics, any prior Data events that don't have that flag set are part of the same chunk as the one that does, and any subsequent Data events are part of a new chunk.

    This has the advantage of being the smallest logical change, it's likely pretty easy and preformant to implement, and it is extremely unobtrusive to users that don't care about this concept. Altogether I think this is the best of the three possibilities in terms of giving tools that care about this (and, to be clear: as much as possible tools should try not to care about this) the ability to get what they need, while keeping that unusual use-case as far away from affecting other users as possible.

Thoughts?

@njsmith
Copy link
Member

njsmith commented Nov 1, 2016

So, my first response is going to be to push back and ask if urllib3 really needs this? ...Are you sure? :-)

Looking at the docs, I can't see any mention of this "give me one encoded chunk at a time" behavior, or chunk boundaries at all? Indeed it looks like the stream method takes a mandatory max-size argument which must mean that stream silently chops at least large chunks into multiple pieces, no? Or am I looking in the wrong place?

Also note that proxies are explicitly allowed to mangle chunk boundaries ("The chunked encoding is specific to each connection and is likely to be removed or recoded by each recipient (including intermediaries)" -- RFC 7230), so even if the API is there then this can hardly be reliable in the general internet...

That said...

allows for pernicious behaviour where the user just keeps shoving data into h11's buffer without h11 ever being able to emit it.

Right, h11 guarantees that it never uses unbounded memory, so option (1) is definitely out. And I don't like (2) for the same reason -- having an bounded-memory guarantee isn't very useful if the first thing popular wrappers like urllib3 do is disable it. (Presumably urllib3 will do whatever it's going to do even for callers who don't care about chunk boundaries, i.e. all of them.) Also, I don't see how this would allow stream to implement the amt argument anyway?

Add a flag to Data events that signal whether they mark the completed end of a chunk

If we're going to do this, then probably the thing to do would be to bite the bullet and start supporting chunk extensions at the same time. These are an extremely obscure feature, where each chunk can have some arbitrary key/value metadata attached to it. Right now we just throw away this metadata (which is a totally compliant option -- even proxies are allowed to do this!). Since this metadata appears at the beginning of chunks, probably the thing to do would be to instead of flagging the end of chunks, add a field to Data that means something like "this is the beginning of a new chunk, and here's the chunk extension metadata".

But again, I'd rather not do it at all :-)

@Lukasa
Copy link
Member Author

Lukasa commented Nov 1, 2016

Looking at the docs, I can't see any mention of this "give me one encoded chunk at a time" behavior, or chunk boundaries at all? Indeed it looks like the stream method takes a mandatory max-size argument which must mean that stream silently chops at least large chunks into multiple pieces, no? Or am I looking in the wrong place?

You're misreading that doc. amt is not mandatory, it just has a default value. And if that value if overridden to None, it will stream in chunk sizes.

The relevant work-trail is kennethreitz/requests#2449 and urllib3/urllib3#550, and the documentation for the feature while not present in urllib3 is present in the Requests documentation.

So while I note that the feature cannot be reliable (indeed, I basically said as much above), that doesn't seem to me to be a compelling reason not to surface the meta-data and allow implementations to make the decision about whether to care. In this case, requests and urllib3 received feature requests to allow this feature which were implemented, and that means that we cannot remove the feature short of a full deprecation cycle. That turns out to be a pain in the neck.

As to your suggestion about supporting chunk extensions: I suspect that you really don't need to do that. Certainly, if it's on your list of features you'd like to support then by all means while we're messing about in the code it's worth adding. But you already have the data I'm asking for: you're just choosing to throw it away, where I'd like it surfaced to the user. Getting chunk extensions requires additional parsing code, which rather changes the workload of the feature. ;)

@Lukasa
Copy link
Member Author

Lukasa commented Nov 1, 2016

While I'm here, let me ask a follow-on. If the position of h11 were that chunk boundaries are non-semantic and cannot be relied up (a reasonable and intellectually consistent position, by the way), it seems odd that the Data event would reflect the boundaries at all. Why not just eagerly parse the entire stream and return a single Data event?

@njsmith
Copy link
Member

njsmith commented Nov 1, 2016

In this case, requests and urllib3 received feature requests to allow this feature which were implemented, and that means that we cannot remove the feature short of a full deprecation cycle.

Reading that history, it's not clear to me that the user even cared about chunk boundaries really -- it sorta looks like the amt=None feature snuck in alongside the implementing of chunked decoding in general, without anyone noticing that this sub-feature was arguably a bad idea and pushing back. (And the referenced blog post seems dangerously confused about the extent to which they're relying on undefined behavior.) Oops. But I guess now it is what it is. Ugh. Fine.

(It might make sense in any case to update the requests docs to emphasize that one really should set chunk_size=<int> and maybe even outright deprecate chunk_size=None? It's not a huge thing but recommending people do something that's unreliable and might trigger unbounded memory use seems suboptimal.)

As to your suggestion about supporting chunk extensions: I suspect that you really don't need to do that. Certainly, if it's on your list of features you'd like to support then by all means while we're messing about in the code it's worth adding. But you already have the data I'm asking for: you're just choosing to throw it away, where I'd like it surfaced to the user. Getting chunk extensions requires additional parsing code, which rather changes the workload of the feature. ;)

Well, parsing them isn't terrible hard -- just a regexp match, and we're already isolating them from the stream to discard them. But yeah, it's not crucial. My point was more that, given there is this other feature that potentially might make sense to support, and that other feature prefers treating chunk starts specially instead of chunk ends, then we should probably follow that here as well. (AFAICT if the goal is just to detect chunk boundaries then either approach is equally good, right?)

Why not just eagerly parse the entire stream and return a single Data event?

Because the "read a (partial) chunk" state machine was much simpler than the "read one or more complete chunks + a trailing partial". The current parser structure is stupid-simple and I don't like writing code when I can skip it :-).

@Lukasa
Copy link
Member Author

Lukasa commented Nov 1, 2016

Oops. But I guess now it is what it is. Ugh. Fine.

This collection of sentences describes like 90% of my opinions about software.

(AFAICT if the goal is just to detect chunk boundaries then either approach is equally good, right?)

Yep, I certainly have no objection to reversing the logic. It requires a bit more out-of-h11 buffering but urllib3 is going to have to do loads of out-of-h11 buffering anyway so we can probably tolerate it.

Because the "read a (partial) chunk" state machine was much simpler than the "read one or more complete chunks + a trailing partial". The current parser structure is stupid-simple and I don't like writing code when I can skip it :-).

That's interesting, I'd have expected that the second is basically just "keep doing the first until we get a partial match". Anyway, not important. ;)

@njsmith
Copy link
Member

njsmith commented Nov 1, 2016

Actually it looks like implementing flags for chunk start, or chunk end, or both, is all pretty simple -- see #20 for a non-working proof of concept. (I'll leave you to turn it into a real patch? :-))

@njsmith
Copy link
Member

njsmith commented Nov 7, 2016

Fixed by #20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants