-
-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy way to retrieve just the first X bytes from a URL #1392
Comments
(This is intended as a feature request, not a bug report) |
I'm actually having trouble figuring out the right way to do this with the existing |
HTTPCore reads from the network in chunks of 64kB: So you won't be able to download in chunks smaller than that, but 64kB sounds like an okay minimum amount of bytes to download before hanging up, right? Then, I think the missing piece for you might be Uvicorn server: BODY = b"*" * 1024 * 1024 # 1MB, i.e. 16 * 64 kB (16 HTTPCore-sized chunks)
async def app(scope, receive, send):
assert scope["type"] == "http"
await send(
{
"type": "http.response.start",
"status": 200,
"headers": [[b"content-type", b"text/plain"]],
}
)
await send({"type": "http.response.body", "body": BODY}) Client: import httpx
url = "http://localhost:8000"
truncate_after = 32 * 1024 # 32kB
with httpx.stream("GET", url) as response:
body = b""
for chunk in response.iter_bytes():
body += chunk
if response.num_bytes_downloaded >= truncate_after:
break
assert len(body) == response.num_bytes_downloaded
print(len(body)) # Actually 64kB (due to HTTPCore's chunk size) Can also be rewritten using with httpx.stream("GET", url) as response:
truncated_iter_bytes = itertools.takewhile(
lambda _: response.num_bytes_downloaded < truncate_after,
response.iter_bytes(),
)
body = b"".join(truncated_iter_bytes) I'm not sure this will guarantee that data won't actually be fully read from the server, though. Eg in this case we can see that the server sends the huge body in one go, so it may try to push it through and have it accumulate in the internal socket buffers that we'd just happen to not ready entirely on the client side. This is all supposition though, I'm not a networking junkie enough to tell. :-) Point is, you may need to verify this does do what you want (limit memory usage) in case the server tries to send a huge chunk in one go. Otherwise I think that should do the trick? :-) |
We've got some docs about |
Thanks! The 64KB thing was the bit I was missing - so it looks like I'll be safe if I open a stream and then consume just the first chunk of the iterator. I do think a useful feature for |
Yes, if you roughly know how much you'd need to read and that's below the 64kB limit, you could probably simplify it down to: import httpx
with httpx.stream("GET", "http://localhost:8000") as response:
head = next(response.iter_bytes())
assert len(head) == response.num_bytes_downloaded
print(len(head)) # 65,528 bytes
How do you think of #1277? Adds a bit more control onto the chunking behavior (at least on the user side). Would allow doing something like: with httpx.stream("GET", url) as response:
head = next(response.iter_bytes(chunk_size=1024)) # Only the 1st kB. HTTPCore would still read as 64kB-sized chunks (we figured that was an optimal size for kernel-side vs Python-side processing, see encode/httpcore#135), but those chunks would be further sliced and diced on the HTTPX side to be exposed to users as having the expected size. |
Oh I hadn't seen #1277 - looks like exactly what I want. I still think it's worth considering a higher-level interface for this. I'm a big fan of HTTP libraries that make it easy to deal with potentially hostile inputs, and the key features needed for that are:
|
I believe |
No that's exactly right, it's just the response size limit that's not obvious at the moment. |
Closed via #1277 |
I'm writing code that retrieves HTML from a URL provided by a (untrusted) user.
httpx
has excellent support for timeouts already, but I'm also worried about people giving me a URL to a giant resource - I don't want to naively load the entire thing into process memory.It looks like I can achieve this using
httpx.stream
- but it's not instantly obvious how to close the connection after a certain number of bytes. I'm figuring that out at the moment.I'd love to be able to do this using the simpler
httpx.get()
interface. Maybe something like this:I don't like
truncate_after
here but it's the first thing that came to mind.The text was updated successfully, but these errors were encountered: