perf: fix for cluster aggregation performance #631

ssg2526 · 2024-05-20T19:06:59Z

The changes include performance improvements for cluster mode aggregations flow. The changes are 3 fold -

The hashing of each metric is moved to workers. Now each worker hashes the data and adds that in a key named hash in the metrics json.
- This is distributing the hashing task among workers rather than master doing the CPU task
The Map which was being built on every request freshly, is made global and only if a new metric comes then that will be put in the map else we will get the value and push the item in the array.
- The insertion of large keys is very slow in Map as it is maintaining insertion order and this will minimize the insertions
A major choking point was when master asks workers to provide metrics and workers sending those metrics over IPC causing the choking in IPC which was hindering the request routing from master to workers. The change here is that workers don't send metrics over IPC rather they are writing them in a file and sending master the file name over IPC. Master is then reading the metrics from the files and deleting the files.
- This is removing the congestion on IPC and requests routing from master to workers not getting hampered.

ssg2526 · 2024-05-30T18:46:01Z

@SimenB @siimon kindly approve for workflows. I have added the prettier fixes because of which the earlier workflows failed.

zbjornson

Thanks for working on this!

Do you have any benchmarks you can show? (both the code and the results)

zbjornson · 2024-05-31T22:22:33Z

lib/cluster.js

+						// adding request id in file path to handle concurrency
+						const filename = path.join(
+							os.tmpdir(),
+							`metrics-${process.pid}-${message.requestId}.json`,
+						);
+						fs.writeFile(filename, JSON.stringify(metrics), err => {


I'm really hesitant to involve files, since that suddenly involves the page cache, file system, more syscalls and possibly other surprises like filesystem permissions, running out of disk space and cleaning up. How much of an improvement did this yield? Did you benchmark before or after Node.js v18.6.0, which has this optimization? How much disk I/O load did your system have when benchmarking? I think IPC is "supposed" to be faster than involving the filesystem.

The benchmarks that I have performed are on our own application. The screenshots of the results are shared in the
this Issue Link. The benchmarks were done on a 16 core machine and approximately 250-300 metrics (total size ~500 Kbs). At P99 and above levels we are getting ~10x improvement in app performance. The only disk I/Os we have is logging but that is also not happening on /tmp mounted drive on which we are writing our metrics.
The major choking point was with IPC and creating the map and hashing the object. The detailed bifurcation that I have done on my local machine which is 8 core machine is given In the screenshot below. and the code used is in below given zip file. The zip file contains the node modules as well because I have added some logs in prom-client to get the time data of each parts. Here in screenshot if we see the worst IPC time is always higher. I had run multiple iterations and same/similar results are found. The total aggregation time which is 72ms in this screenshot out of which 68ms is taken for hashing and building the map. I haven't tested with node v18.6.0 yet. will test the same code with that as well and share the results in the same thread
cluster_test.zip

In these benchmarks (issue link) the scraping interval used was every 5 seconds and the throughput on the app was about 1100-1200 RPS.

Hi @zbjornson

I also tried the code in the zip file with node v18.6.0. I am getting the similar results without any improvements in the logged timings.

I think we shouldn't use IPC for this communication because that can create a problem whenever the sizes of metrics go beyond a certain point and that will start blocking the requests routing at high throughputs.

We can ask user to provide permissions for /tmp folder. From prom-client performance POV i didn't see much of a difference with files but again I am only hitting it once every 5 seconds.

If not files then we should think of other ways of communication to solve this completely like workers calling over tcp on master port

Hi @zbjornson @SimenB we went live in production with these changes about 3 months ago and everything is working fine for us. We are not seeing any additional latencies post this change. We serve more than 100,000 RPS at peak for this service. If required I can share the production results as well.
I understand that introducing file system can pose different challenges for the library, But we can explore communication with HTTP or some other mode of communication (Not IPC) when we have larger size of the metrics as the current solution will cause very high tail latencies (>P99) at high throughput and larger metrics sizes.
Requesting your inputs on this, In terms of how we can take this forward as we don't want to diverge from the library maintaining our own custom version.

fix for cluster aggregation performance

473f29c

ssg2526 mentioned this pull request May 20, 2024

Prom-Client aggregation causing master to choke in cluster mode #628

Open

Saurabh Singh added 2 commits May 23, 2024 14:44

added change log and handling concurrency

cbe29f3

linting and prettier fixes

7879114

SimenB requested a review from zbjornson May 31, 2024 06:06

making values array empty in each scrape

49da0d9

zbjornson reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix for cluster aggregation performance #631

perf: fix for cluster aggregation performance #631

ssg2526 commented May 20, 2024

ssg2526 commented May 30, 2024

zbjornson left a comment

zbjornson May 31, 2024

ssg2526 Jun 3, 2024 •

edited

Loading

ssg2526 Jun 3, 2024

ssg2526 Jun 6, 2024 •

edited

Loading

ssg2526 Sep 3, 2024 •

edited

Loading

perf: fix for cluster aggregation performance #631

Are you sure you want to change the base?

perf: fix for cluster aggregation performance #631

Conversation

ssg2526 commented May 20, 2024

ssg2526 commented May 30, 2024

zbjornson left a comment

Choose a reason for hiding this comment

zbjornson May 31, 2024

Choose a reason for hiding this comment

ssg2526 Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

ssg2526 Jun 3, 2024

Choose a reason for hiding this comment

ssg2526 Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

ssg2526 Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

ssg2526 Jun 3, 2024 •

edited

Loading

ssg2526 Jun 6, 2024 •

edited

Loading

ssg2526 Sep 3, 2024 •

edited

Loading