Run DataFusion benchmarks regularly and track performance history over time #5504

alamb · 2023-03-07T15:16:23Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.

This means that

We may miss some performance regressions (such as Optimize Accumulator size function performance (fix regression on clickbench) #5325)
Since the benchmarks are not run regularly it is hard to know how to interpret results, and some seem to have bitrotted over time
The wide variety of available benchmarks (e.g. Review existing datafusion benchmarks and clean them up #5502) makes it hard to know which ones to run and how to determine if performance has improved or regressed for particular changes

Describe the solution you'd like
I would like

A system that runs DataFusion benchmarks regularly on main
Some automated way to see if a particular PR has improved or regressed performance
Bonus: a webpage that shows performance over time. Databend has a great example https://perf.databend.rs/

Suggestion

I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted

The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in #1791 by @dianaclarke

You can see its integration as it posts comments on PRs after merge such as #5476 (comment)

Describe alternatives you've considered
We could use existing timeseries databases and visualizations like grafana to visualize the information

Additional context

The text was updated successfully, but these errors were encountered:

andygrove · 2023-03-07T15:36:32Z

I had started working on this, not just for DataFusion, but for other OSS query engines as well.

See https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ for an example.

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

ozankabak · 2023-03-09T04:53:13Z

I was planning on automating this to run nightly, or on each merge to master, but have not found the time to do this yet.

This would be fantastic!

epompeii · 2023-04-17T13:05:11Z

How are you considering running the benchmarks in CI @andygrove ?
I've been working on a tool similar to conbench called Bencher (https://github.com/bencherdev/bencher), and I've been trying to explore the best ways to do continuous benchmarking.

alamb · 2023-04-21T19:42:23Z

How are you considering running the benchmarks in CI @andygrove ?

Using CI in general would be ideal.

Using the github runners is probably not a great idea as they are quite variable (they are on shared machines and there isn't any visibility into what else is going on on those machines / VMs)

alamb · 2023-04-24T11:14:39Z

I was thinking about this the other day -- and it seems to me this might be a perfect usecase for a timeseries database (which full disclosure we have at InfluxData). I was thinking that one way to record the history over time is use a widely supported format like LineProtocol -- https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/

Then we can visualize and display that data over time using existing tools like grafana. Also the "alert if things get slower" sounds a lot like the kinds of alerts used in timeseries databases.

I'll try and whip up some way to convert benchmark results into line protocol over the next few weeks

alamb · 2023-04-24T11:24:12Z

My high level plan is something like:

Export benchmark information as line protocol #6107
Write a script to run the benchmark and generate line protocol with a specific git repo / sha
write up instructions on how to visualize that data (maybe locally with grafana)

epompeii · 2023-04-25T13:36:32Z

@alamb this is effectively what I have built with Bencher. It tracks benchmarks over time, allows you to visualize the results (and share them as an auto-updating README image), and uses statistical thresholds to generate alerts, in the case of performance changes. There's a CLI and a GitHub Action to make it easier to run both locally and in CI.

Using CI in general would be ideal.

Definitely! As for compute, I've heard folks mention https://buildjet.com as an option. In the long run, I'm looking to add AWS Bare Metal runners to Bencher to help with this end of things. So please let me know if you are interested in going that direction.

alamb · 2023-09-15T13:43:21Z

So I can find the compute resources to run this, and I think we can use existing timeseries databases to track performance over time (e.g. IOx or grafana). The only missing piece is getting the data into a format that can be loaded into these systems

I think we could get a simple version of this feature going like:

Invoke bench.sh for the benchmarks of interest
Use Export benchmark information as line protocol #6107 to convert the existing bench output to lineprotocol (I wrote up more details of this transformation)
Check in the line protocol somewhere
Upload lineprotocol to influxdb or some other system that supports it to make some simple visualizations

alamb · 2023-09-25T10:09:50Z

It seems as if conbench is now more actively maintained that it once was and claims to have rust support: https://github.com/conbench/conbench

Perhaps it is worth another look

Smurphy000 · 2023-10-14T19:02:16Z

I am interested in spending some time with helping out on benchmarking. Currently checking out conbench, but if another solution is needed I can look into that as well.

I'm interested to know how many permutations we need to track continually (eg. machine size, simd or not, etc).

Looking for any additional thoughts / guidance besides what I have read so far in this issue.

alamb · 2023-10-15T09:55:17Z

Hi @Smurphy000 that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for.

The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating.

Step 1: transform benchmark data for graphing

I first recommend checking out #6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner.

In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well

Step 2: script to gather data for each commit

So in my mind, the ideal solution looks like:

A runner script that runs the benchmarks, and appends the results to some sort of text file (ideally Export benchmark information as line protocol #6107) that we can check in and that is easy to visualize
Written in one of the existing languages used in the DataFusion repo: python, bash, or rust

If you fancy a bit of bash scripting, maybe you could potentially start with bench.sh and extend it to check out the desired SHAs (I could do do this part too, if you were able to do #6107)

Here is one way a testing session might look

# setup, fetch all needed data files
./benchmarks/bench.sh data 

# Run tpch benchmarh on commit 4819e7a, 
# leave results in ./benchmarks/results/4819e7a
# Would add appropriate lines to  ./benchmarks/results/history.lp
./benchmarks/bench.sh run tpch --commit 4819e7a

Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know)

Does that make sense?

epompeii · 2023-10-15T13:13:49Z

@alamb the port of InfluxDB over to Rust is super cool. Congrats!
I'm considering using it long term, if/when Bencher needs a supplemental backend for results storage.

Before @Smurphy000 goes reinventing the wheel here though, I just wanted to point out that Bencher handles all of this out of the box. All you would need to do is add a flag to your existing runner to output JSON in the expected format, and you're set: https://bencher.dev/docs/explanation/adapters#-json
Bencher would then handle Step 1 (example of continuous benchmark dogfooding with Bencher)
And the Bencher CLI makes it trivial to handle Step 2: bencher run "./benchmarks/bench.sh data"
Then you could also detect performance regressions in CI and even have it comment on PRs: https://bencher.dev/docs/explanation/thresholds/

As far as the orchestration of jobs from GitHub -> dedicated VM, this is something I am actively exploring making easier to do. I'm looking at creating an extension to Bencher for creating GitHub Actions Self-Hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners
I'm also open to other ideas if you all have any!

alamb · 2023-10-16T11:49:25Z

@epompeii -- I would love it if someone could explore the possibility -- I haven't had the time to look into whatever bencher supports. There are many different continious benchmarking frameworks / services and the reason I think people keep reinventing the wheel is that the mental effort to integrate with the frameworks (e.g. understanding the bencher json format , understanding what a slug is, or how to map important metadata like "num cpus" to this model) is often larger than creating something custom (or it least appears so from the outside)

In my opinion this is why we don't use conbench -- no one has invested enough time to meld that model with the datafusion numbers

I agree it is infuriatingly repetitive. However unless you (or someone else) directly integrates DataFusion with one of these frameworks in a way that works, I am going to expend my effort explaining what is easiest for me (aka tools I already know), unfortunately.

Smurphy000 · 2023-10-16T12:29:37Z

Currently I am looking into the few options we currently have (conbench, bencher, custom) and I can update here with my findings so some agreement on the best option can be made. I am starting with @alamb idea selfishly to learn a few new things, but I definitely want to take a look into existing frameworks!

epompeii · 2023-10-16T13:49:40Z

@alamb that totally makes sense. I'm the maintainer of Bencher, so if you or @Smurphy000 would like a walk through/to hop on a call, that may be a pretty quick way to map concepts. Long term, I'm hoping that the docs get good enough to make this mapping super easy and intuitive. That's still a work in progress though 😃

To answer your specific example about "num cpus" though, that would be metadata about the Testbed (https://bencher.dev/docs/explanation/benchmarking/#testbed). Currently the best way to have that available in Bencher itself is to include it in the name. You can compare benchmarks results by Testbed. (ex "Debian x86 4 CPU 8GB" vs "Debian x86 8 CPU 16GB")

And definitely explore all your options @Smurphy000 ! If you haven't seen it already, I've compiled a pretty comprehensive list of prior art in the space, that may be helpful here: https://bencher.dev/docs/reference/prior-art/
Let me know if you find anything that you think I should add!

alamb · 2024-02-28T13:49:02Z

@gruuya and I were talking about this project recently and he said he may have some ideas on how to proceed / push it forward.

gruuya · 2024-03-01T20:23:18Z

Ok, I think I'm up to speed with the history of this issue now.

To me it seems that there are 2 separate action items discussed here:

CI benchmark that will comment on PRs once merged to main, and possibly on open PRs after invoking some hook (points 1 and 2 in the issue description). This will be just a summary of the perf comparison between the base commit and the current/merge commit. This might be done using github shared runners, as presumably the variance wouldn't be too wild during the same run.
As an extension of 1. above, a way to persist this info, and ideally visualize it over time. This would involve a bit more effort, as it would require setting up (and maintaining) dedicated infra to control for the variance. I guess the idea was to be checking in the benchmark numbers into some(this?) github repo, as opposed to a cloud DB somewhere?

Also looks like both conbench and bencher are eligible to handle the above.

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

epompeii · 2024-03-02T17:34:33Z

@gruuya I would be more than happy to help with the integration of Bencher, which could actually help with both 1. and 2.
I set up a similar approach to 1. for Diesel recently: diesel-rs/diesel#3849
It used relative benchmarking, which looks something like this: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking
So you only compare the base commit and the current/merge commit on the same runner, like you want.

For 2. we would just benchmark on pushes to main and you would be able to visualize them with a Bencher Perf Page, this in an example for the Rustls project.

alamb · 2024-03-03T08:40:51Z

IMO it's best to try and tackle 1. for now, as this would present a big win, and would pave the way to 2 as well.

I agree this is a good idea / approach.

Note I tried to figure out conbench in the past and I didn't get very far -- though I didn't spend a huge amount of focused time on it.

gruuya · 2024-03-07T18:27:30Z

Ok, so I went ahead and took a stab at 1. #9461

Note that there are some issues with properly testing this out, as noted in the description (though i could be missing something GitHub action trick), and so the effort would require a follow up PR.

The idea is to just run the usual benchmarks that devs do locally, and then post a comment with the results to the PR, so nothing fancy.

I can also post some ideas I had about potential follow-up work later on if this makes sense.

alamb · 2024-03-08T20:33:50Z

Thanks @gruuya -- I'll check out #9461 shortly. I am back catching up with alerts

korowa · 2024-03-26T19:10:06Z

Just some thoughts based on #9800 -- at this moment benchmark results may have significant fluctuations (which is expected), but on the other side, an example from #9461 (splitgraph#1) shows that is fine to use them for tracking some significant performance regressions.

Maybe we should increase 5% "no change" threshold for these benchmarks? It won't be able to show majority of improvements, but it still will be able to highlight if changes are literally breaking in terms of performance.

UPD: another options (or additions to modifying threshold) might be increasing the number of iterations / sleeping between them

gruuya · 2024-03-26T22:16:01Z

Hey @korowa, thanks for your observations! Indeed my own experience suggests a slight bias against the PR results for some reason: typically a couple of queries reported with 1.05-1.3 slow-up (even if nothing is changed), so I'd ignore anything in that range atm (not sure it's worth increasing the 5% threshold though).

That said, I think the current setup is sufficient to catch larger regressions for now. I also don't think increasing the number of iterations / sleeping between them on it's own would be good enough, since we'd trade that against the increased longitudinal performance variance component of the shared GitHub runners, but I guess it might be worth a try.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

Following that I see the list of improvements as:

(optional) simplify CI benchmark development: Simplify CI benchmark comparison development #9638
add more benchmarks (a selection of Clickbench + TPC-DS queries)
track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)
(optional) re-base benchmarks on criterion.rs; this provides a neat standardized way of running/analyzing/collecting these stats in as-of-recently a bit user-friendlier manner: https://www.tweag.io/blog/2022-03-03-criterion-rs/

alamb · 2024-03-27T12:03:01Z

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

I agree with this. How can we make it happen? If we found a monetary sponsor to host the runner, would you be willing to set it up?

Another potential thing to do is to increase the data size / time required per query. As @korowa notes when the queries take only a few 10s of ms to run, the variablity related to the runners is a significant portion of overall query time. I think this will still be a problem with a dedicated runner.

This would of course increase the time required to run benchmarks.

gruuya · 2024-03-27T12:55:09Z

would you be willing to set it?

Yup, I can do that.

Another potential thing to do is to increase the data size / time required per query.

Yeah, I think something like adding TPC-H SF 10 would help somewhat.

epompeii · 2024-03-29T18:51:52Z

track benchmarks over time, through a similar job triggered by merge commits to main (fwiw, I now prefer Bencher to conbench, as it seems simpler to setup/maintain)

@gruuya let me know if you run into anything or have any questions getting setup with Bencher.
I would be more than happy to help answer any questions or with parts of the integration work here.

(optional) re-base benchmarks on criterion.rs

If you do move to Criterion, Bencher has a built-in adapter for Criterion which should make things pretty simple.

The more long term solution, and the first next step that makes sense to me would be to run these on a dedicated runner.

@alamb if you all want to build things out yourselves, the Rustls bench runner may be a good starting point. I recently wrote a case study of the Rustls continuous benchmarking setup if that is of interest.

Another possibility is that I am working on Bencher Cloud Bare Metal. It is a fleet of identical hardware and benchmarks are run on the bare metal servers. I can go into more detail if that is something you all want to explore.

alamb added enhancement New feature or request performance Make DataFusion faster labels Mar 7, 2023

alamb mentioned this issue Mar 7, 2023

[Epic] Improved DataFusion Benchmarking #5505

Open

12 tasks

alamb added the help wanted Extra attention is needed label Mar 7, 2023

This was referenced Mar 7, 2023

long-term stored benchmark performance #3539

Closed

Benchmark runs over time using conbench / ursabench #3088

Closed

alamb mentioned this issue Apr 24, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

alamb mentioned this issue Apr 24, 2023

Export benchmark information as line protocol #6107

Open

alamb self-assigned this Apr 24, 2023

alamb mentioned this issue Sep 12, 2023

Performance Regression: Backtraces in errors slow down planning time (Expensive backtraces) #7522

Closed

alamb changed the title ~~Run DataFusion benchmarks regularly and track history~~ Run DataFusion benchmarks regularly and track performance history over time Sep 25, 2023

gruuya mentioned this issue Mar 7, 2024

Add /benchmark github command to comparison benchmark between base and pr commit #9461

Merged

gruuya mentioned this issue Mar 27, 2024

Run TPC-H SF10 during PR benchmarks #9822

Merged

This was referenced Mar 29, 2024

Remove vestigal conbench integration #9855

Merged

Minor: Remove the bench most likely to cause OOM in CI #9858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run DataFusion benchmarks regularly and track performance history over time #5504

Run DataFusion benchmarks regularly and track performance history over time #5504

alamb commented Mar 7, 2023 •

edited

Loading

andygrove commented Mar 7, 2023

ozankabak commented Mar 9, 2023

epompeii commented Apr 17, 2023

alamb commented Apr 21, 2023

alamb commented Apr 24, 2023

alamb commented Apr 24, 2023

epompeii commented Apr 25, 2023

alamb commented Sep 15, 2023

alamb commented Sep 25, 2023

Smurphy000 commented Oct 14, 2023

alamb commented Oct 15, 2023

epompeii commented Oct 15, 2023

alamb commented Oct 16, 2023

Smurphy000 commented Oct 16, 2023

epompeii commented Oct 16, 2023

alamb commented Feb 28, 2024

gruuya commented Mar 1, 2024

epompeii commented Mar 2, 2024

alamb commented Mar 3, 2024

gruuya commented Mar 7, 2024

alamb commented Mar 8, 2024

korowa commented Mar 26, 2024 •

edited

Loading

gruuya commented Mar 26, 2024

alamb commented Mar 27, 2024

gruuya commented Mar 27, 2024

epompeii commented Mar 29, 2024

Run DataFusion benchmarks regularly and track performance history over time #5504

Run DataFusion benchmarks regularly and track performance history over time #5504

Comments

alamb commented Mar 7, 2023 • edited Loading

andygrove commented Mar 7, 2023

ozankabak commented Mar 9, 2023

epompeii commented Apr 17, 2023

alamb commented Apr 21, 2023

alamb commented Apr 24, 2023

alamb commented Apr 24, 2023

epompeii commented Apr 25, 2023

alamb commented Sep 15, 2023

alamb commented Sep 25, 2023

Smurphy000 commented Oct 14, 2023

alamb commented Oct 15, 2023

Step 1: transform benchmark data for graphing

Step 2: script to gather data for each commit

epompeii commented Oct 15, 2023

alamb commented Oct 16, 2023

Smurphy000 commented Oct 16, 2023

epompeii commented Oct 16, 2023

alamb commented Feb 28, 2024

gruuya commented Mar 1, 2024

epompeii commented Mar 2, 2024

alamb commented Mar 3, 2024

gruuya commented Mar 7, 2024

alamb commented Mar 8, 2024

korowa commented Mar 26, 2024 • edited Loading

gruuya commented Mar 26, 2024

alamb commented Mar 27, 2024

gruuya commented Mar 27, 2024

epompeii commented Mar 29, 2024

alamb commented Mar 7, 2023 •

edited

Loading

korowa commented Mar 26, 2024 •

edited

Loading