Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http optimizations #8002

Merged
merged 4 commits into from
Jul 29, 2019
Merged

Conversation

asterite
Copy link
Member

These are a series of refactors to improve parsing of HTTP requests. Check each commit for more details (each has a description in it).

This is more code, more complex code, and more low-level code, but that's the whole idea of Crystal: if you want to go deeper and optimize hot paths you can do it in Crystal itself.

I wrote this benchmark:

require "benchmark"
require "http"

# # request.txt was generated using this code
# request = <<-REQUEST.lines.join("\r\n") + "\r\n\r\n"
# GET /hello.htm HTTP/1.1
# User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
# Host: www.tutorialspoint.com
# Accept-Language: en-us
# Accept-Encoding: gzip, deflate
# Connection: Keep-Alive
# REQUEST
# File.write("request.txt", request)

io = File.open("request.txt")

Benchmark.ips do |x|
  x.report("from_io") do
    io.rewind
    HTTP::Request.from_io(io)
  end
end

I read from a file to simulate reading from an external resource, though the entire file probably fits IO::Buffered's buffer (but the same should be true for Socket). But later I'll show the results with IO::Memory too.

If I run the above benchmark against master:

$ crystal foo.cr --release
from_io 335.90k (  2.98µs) (± 1.07%)  2.1kB/op  fastest

When against this PR:

$ bin/crystal foo.cr --release
from_io 495.93k (  2.02µs) (± 0.85%)  816B/op  fastest

So a 30% improvement! 😄

Also note the memory allocated per op: 2.1kB before, now 816B. This is the main reason it's faster.

If I use IO::Memory instead of File I get:

Before: from_io 509.34k (  1.96µs) (± 0.75%)  2.1kB/op  fastest
After:  from_io   1.01M (991.82ns) (± 0.75%)  816B/op  fastest

Since HTTP servers are pretty common in Crystal I thought this is a good way to optimize all apps out there. I know usually a lot more goes on in a typical web server (for example rendering) but the less time and memory the framework takes for itself, the better.

I didn't benchmark an HTTP::Server with ab or similar after this change, but you are welcome to do that and post the results here!

src/http/headers.cr Outdated Show resolved Hide resolved
src/http/request.cr Show resolved Hide resolved
src/http/request.cr Show resolved Hide resolved
src/http/request.cr Outdated Show resolved Hide resolved
@asterite asterite force-pushed the http-optimizations branch 4 times, most recently from 454f744 to d8e2294 Compare July 27, 2019 11:53
Instead of storing headers internally as `Hash(String, Array(String))`
we store them as `Hash(String, String | Array(String))`. This involves
a bit more logic when dealing with this union but it saves a fair
amount of memory because for headers with just a single value, which
is the most common case, we avoid allocating memory for an array.
src/http/common.cr Outdated Show resolved Hide resolved
src/http/request.cr Outdated Show resolved Hide resolved
src/http/common.cr Outdated Show resolved Hide resolved
src/http/request.cr Outdated Show resolved Hide resolved
@asterite
Copy link
Member Author

Sorry about the constant rebase, next time I'll keep adding commits to make it easier to review and just at the end I'll rebase everything before merging (I learned rebase -i and now I can't stop, hehe :-P)

Instead of using `String#split`, which creates an array with three
strings, we find the space indexes and create subslices/substrings for
each of the pieces.
We also avoid allocating a string for common HTTP methods (GET, POST,
etc.) and for the supported HTTP versions.
Finally, we use `IO#peek` to see if we can find the request line there
instead of allocating a String for it.
@straight-shoota
Copy link
Member

git commit --fixup is great for this. It annotates a commit as fixup of a previous one and with git rebase -i --autosquash they get automatically inserted at the right place.

@asterite
Copy link
Member Author

Ooooh... I didn't know that. I'll try it next time. Thanks!

When creating an `HTTP::Request` and passing it some `HTTP::Headers`,
the headers are dupped to prevent the request from modifying data that
the user might hold. However, dupping the headers when parsing a
request from an IO is not necessary. This avoid some unneeded memory
allocations.
We try to use `IO#peek` and read header lines directly from there,
avoiding an extra String allocation for the entire  request line. Then
we avoid allocating strings for common header field names like `Host`
and `Content-Length`).
@straight-shoota straight-shoota added this to the 0.30.0 milestone Jul 29, 2019
@straight-shoota straight-shoota merged commit fe7e663 into crystal-lang:master Jul 29, 2019
@straight-shoota
Copy link
Member

Thank you @asterite

@RX14
Copy link
Contributor

RX14 commented Jul 29, 2019

I was still reviewing this and had changes to request :<

@straight-shoota
Copy link
Member

Oh, I'm sorry 😢 I should've just merged it yesterday right away^^

Just a suggestion: When I'm reviewing a PR that might get merged while doing that, I tend do request a review from myself to signal that I'm currently looking at it (or intend to do so).

src/http/request.cr Show resolved Hide resolved
src/http/request.cr Show resolved Hide resolved
src/http/request.cr Show resolved Hide resolved
src/http/common.cr Show resolved Hide resolved
src/http/common.cr Show resolved Hide resolved
@asterite
Copy link
Member Author

By the way, I benchmarked the simple http server on the samples directory before and after this change with ab. It's just a silly example but it can show whether this change had an effect on the overall http roundtrip.

Doing:

ab -k -c 100 -n 200000 127.0.0.1:8080/

Before:

Time taken for tests:   1.918 seconds
Requests per second:    104299.05 [#/sec] (mean)
Time per request:       0.959 [ms] (mean)
Time per request:       0.010 [ms] (mean, across all concurrent requests)
Transfer rate:          10287.31 [Kbytes/sec] received

After:

Time taken for tests:   1.716 seconds
Requests per second:    116551.95 [#/sec] (mean)
Time per request:       0.858 [ms] (mean)
Time per request:       0.009 [ms] (mean, across all concurrent requests)
Transfer rate:          11495.85 [Kbytes/sec] received

And I'm almost sure if we change Hash to have an open addressing implementation it could go up to 121750.26 requests per second (just a number out of the top of my head 😉).

By comparison, doing the same benchmark against a simple server in Go gives these results:

Time taken for tests:   2.767 seconds
Requests per second:    72285.75 [#/sec] (mean)
Time per request:       1.383 [ms] (mean)
Time per request:       0.014 [ms] (mean, across all concurrent requests)
Transfer rate:          10800.51 [Kbytes/sec] received

However, Go handles parallelism and when we'll have parallelism the performance will get a bit worse, but on the other hand a single request doing expensive CPU won't be able to stop the server from receiving other requests.

@RX14
Copy link
Contributor

RX14 commented Jul 29, 2019

@asterite could you test with wrk instead in the future? It gives much better and more realistic results, and is the industry standard http benchmarking tool these days.

@asterite
Copy link
Member Author

@RX14 Sure! Here it it:

wrk -t12 -c400 -d30s http://127.0.0.1:8080/

(I don't know if those values are good, I just copied them from their github repo)

Before this PR:

Running 30s test @ http://127.0.0.1:8080/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.51ms  750.62us  14.12ms   80.25%
    Req/Sec     9.31k   687.19    26.23k    96.78%
  3339715 requests in 30.10s, 321.69MB read
  Socket errors: connect 0, read 244, write 0, timeout 0
Requests/sec: 110942.98
Transfer/sec:     10.69MB

After this PR:

Running 30s test @ http://127.0.0.1:8080/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.26ms  603.93us   9.90ms   82.14%
    Req/Sec     9.92k   594.01    14.69k    90.38%
  3562875 requests in 30.10s, 343.18MB read
  Socket errors: connect 0, read 239, write 0, timeout 0
Requests/sec: 118355.73
Transfer/sec:     11.40MB

With a "hypothetical" Hash with open addressing:

Running 30s test @ http://127.0.0.1:8080/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.13ms  446.90us  12.34ms   81.88%
    Req/Sec    10.33k   434.48    12.97k    83.14%
  3699575 requests in 30.00s, 356.35MB read
  Socket errors: connect 0, read 237, write 0, timeout 0
Requests/sec: 123302.09
Transfer/sec:     11.88MB

Go:

Running 30s test @ http://127.0.0.1:8080/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.78ms    7.57ms 336.03ms   94.45%
    Req/Sec     8.50k     1.07k   45.00k    96.03%
  3046719 requests in 30.10s, 374.82MB read
  Socket errors: connect 0, read 250, write 0, timeout 0
Requests/sec: 101219.02
Transfer/sec:     12.45MB

So the results are very different from ab but we can see each optimization is a bit noticeable. Also Go does pretty good too.

@RX14
Copy link
Contributor

RX14 commented Jul 29, 2019

@asterite is that Go with GOMAXPROCS=1 or is go actually using all cores and still losing?

@RX14
Copy link
Contributor

RX14 commented Jul 29, 2019

Also, are you going to address the review?

@asterite
Copy link
Member Author

@asterite is that Go with GOMAXPROCS=1 or is go actually using all cores and still losing?

No, that's without specifying GOMAXPROCS (so I guess using all cores).

If I pass GOMAXPROCS=1 I get:

Running 30s test @ http://127.0.0.1:8080/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.14ms   88.69ms   2.00s    98.73%
    Req/Sec     3.14k   339.80     8.62k    85.96%
  1127120 requests in 30.10s, 138.66MB read
  Socket errors: connect 0, read 397, write 0, timeout 118
Requests/sec:  37442.65
Transfer/sec:      4.61MB

So I guess if with parallelism we can be close to Go numbers it'll be more than good ( /cc @waj @bcardiff )

Also, are you going to address the review?

Yes, but in a couple of hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants