Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add gateway histogram metrics #8443

Merged
merged 4 commits into from
Mar 21, 2022
Merged

Conversation

aschmahmann
Copy link
Contributor

fixes #8441

@gmasgras did you want both the summary and histogram exposed, or should I just switch the summary over to a histogram?

Any preference on the metric name? For example, if replacing the summary metric we may still want a different name so infra that's running the older version and the newer version can have the metrics play together a little nicer.

@gmasgras
Copy link

No real preference but something like ipfs_http_unixfs_get_latency_seconds_buckets would be in line with how most metrics exposing counters and histograms are named

@aschmahmann
Copy link
Contributor Author

Do you want both ipfs_http_unixfs_get_latency_seconds for the summary and ipfs_http_unixfs_get_latency_seconds_buckets for the histogram?

@gmasgras
Copy link

ipfs_http_unixfs_get_latency_seconds_buckets should be enough, as it allows us to calculate the summaries exported in ipfs_http_unixfs_get_latency_seconds

@BigLep
Copy link
Contributor

BigLep commented Mar 3, 2022

@schomatis : can you please review/merge?

@BigLep
Copy link
Contributor

BigLep commented Mar 3, 2022

@aschmahmann : I'm moving this to "ready for review"

@BigLep BigLep marked this pull request as ready for review March 3, 2022 06:59
@BigLep BigLep added this to the Best Effort Track milestone Mar 3, 2022
schomatis
schomatis previously approved these changes Mar 9, 2022
Copy link
Contributor

@schomatis schomatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but I know nothing of our metric system.

@BigLep
Copy link
Contributor

BigLep commented Mar 10, 2022

@aschmahmann : any concern with merging?

@BigLep BigLep requested a review from lidel March 11, 2022 16:24
@BigLep BigLep assigned lidel and unassigned aschmahmann Mar 11, 2022
@BigLep BigLep requested review from lidel and removed request for lidel March 11, 2022 16:25
@lidel
Copy link
Member

lidel commented Mar 12, 2022

I will update this PR to work with non-unixfs response types added in #8758

@schomatis schomatis dismissed their stale review March 14, 2022 13:11

Testing if this removes me from the review list and removes the PR from my view in the board (I think it doesn't but want to confirm).

@BigLep BigLep requested review from schomatis and removed request for schomatis March 15, 2022 15:03
@lidel lidel force-pushed the feat/gateway-ttfb-histogram branch from 0176b66 to 914ed40 Compare March 16, 2022 17:23
@lidel lidel changed the base branch from master to feat/fetch-as-block-or-car March 16, 2022 17:23
@lidel lidel changed the title feat: add gateway unixfs time to first block histogram feat: add gateway histograms Mar 16, 2022
@lidel
Copy link
Member

lidel commented Mar 16, 2022

I rebased this PR on top of feat/gateway-ttfb-histogram (#8758) and introduced a bunch of histograms (they have _sum and _count so we don't need any additional summaries):

New Histogram Metrics

  • generic

    • time to first content block histogram (gw_first_content_block_get_latency_seconds) is response-type-agnostic and aims to supersede the old deprecated summary at unixfs_get_latency_seconds
  • per response type duration histograms that give us visibility into how popular and expensive each response type is:

    • unixfs file (gw_unixfs_file_get_duration_seconds)
    • unixfs dir listing that is HTML generated by go-ipfs on-the-fly (gw_unixfs_gen_dir_listing_get_duration_seconds)
    • CAR stream (gw_car_stream_get_duration_seconds)
    • single raw Block (gw_raw_block_get_duration_seconds)

👉 as per usual, feedback on names / descriptions / usefullness is appreciated.

Demo

http://127.0.0.1:5001/debug/metrics/prometheus will have histogram per namespace
(they appear the moment we had the first event, so you need to load file/dir/car/block for them to appear):

# HELP ipfs_http_gw_first_content_block_get_latency_seconds The time till the first content block is received on GET from the gateway.
# TYPE ipfs_http_gw_first_content_block_get_latency_seconds histogram

ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="0.1"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="0.5"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="1"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="2"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="3"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="5"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="8"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="13"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipfs",le="+Inf"} 3
ipfs_http_gw_first_content_block_get_latency_seconds_sum{gateway="ipfs"} 0.024477469
ipfs_http_gw_first_content_block_get_latency_seconds_count{gateway="ipfs"} 3
...
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="0.1"} 60
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="0.5"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="1"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="2"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="3"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="5"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="8"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="13"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_bucket{gateway="ipns",le="+Inf"} 63
ipfs_http_gw_first_content_block_get_latency_seconds_sum{gateway="ipns"} 0.9177464280000002
ipfs_http_gw_first_content_block_get_latency_seconds_count{gateway="ipns"} 63
...
# HELP ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds The time it takes till generated HTML with directory listing is served on GET from the gateway.
# TYPE ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds histogram
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="3"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="8"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="13"} 1
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 2
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_sum{gateway="ipfs"} 19.474375763
ipfs_http_gw_unixfs_gen_dir_listing_get_duration_seconds_count{gateway="ipfs"} 2

@BigLep
Copy link
Contributor

BigLep commented Mar 16, 2022

Awesome @lidel ! Is review needed here now?

@lidel
Copy link
Member

lidel commented Mar 16, 2022

Yes, especially from our gateway/infra team (@gmasgras created original ask in #8441).
I will also ping our ipfs-operators community.

Review prompts:

  • are defaultBuckets good enough for all histograms?
  • are proposed metrics useful or overkill?
  • should we add / remove anything?

core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
@lidel lidel changed the title feat: add gateway histograms feat: add gateway histogram metrics Mar 16, 2022
@lidel lidel requested review from Jorropo and guseggert March 16, 2022 17:45
@lidel lidel added the need/community-input Needs input from the wider community label Mar 16, 2022
Base automatically changed from feat/fetch-as-block-or-car to master March 17, 2022 16:15
@gmasgras
Copy link

Yes, especially from our gateway/infra team (@gmasgras created original ask in #8441). I will also ping our ipfs-operators community.

Review prompts:

* are defaultBuckets good enough for all histograms?

* are proposed metrics useful or overkill?

* should we add / remove anything?

The new metrics look good, only suggestions is that we add a couple more buckets.

core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved

// Legacy Metrics
// ----------------------------
unixfsGetMetric: newGatewaySummaryMetric( // TODO: remove?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's low-effort to keep right? Might as well keep it around to avoid breaking the monitoring of gateway operators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my plan is to keep it at least for one/two releases.

aschmahmann and others added 3 commits March 19, 2022 00:28
Adds:

- response-type agnostic firstContentBlockGetMetric which counts the
  latency til the first content block.

- car/block/file/gen-dir-index duration histogram metrics that show how
  long each response type takes
Co-authored-by: Gus Eggert <gus@gus.dev>
@lidel lidel force-pushed the feat/gateway-ttfb-histogram branch from 42fac3b to ae5de99 Compare March 18, 2022 23:31
@lidel
Copy link
Member

lidel commented Mar 18, 2022

  • We had some wacky conflicts with master. Rebased and resolved conflicts, let's see if CI pass.
  • Remaining work is to update buckets based on feedback – will do this next week.

0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60
as suggested in reviews:
#8443
@lidel
Copy link
Member

lidel commented Mar 21, 2022

I've changed buckets to include asks/suggestions from the reviews, we now have 10 of them:
50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s, 60s

@lidel lidel removed the need/community-input Needs input from the wider community label Mar 21, 2022
@lidel lidel merged commit beaa8fc into master Mar 21, 2022
@lidel lidel deleted the feat/gateway-ttfb-histogram branch March 21, 2022 14:57
hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 27, 2023
* feat(gw): response type histogram metrics

- response-type agnostic firstContentBlockGetMetric which counts the
  latency til the first content block.

- car/block/file/gen-dir-index duration histogram metrics that show how
  long each response type takes

* docs: improve metrics descriptions
* feat: more gw histogram buckets

0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60 secs
as suggested in reviews at ipfs/kubo#8443

Co-authored-by: Marcin Rataj <lidel@lidel.org>
Co-authored-by: Gus Eggert <gus@gus.dev>

This commit was moved from ipfs/kubo@beaa8fc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Improved Gateway HTTP request metrics
6 participants