Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Add finer metrics on operations covered by time_elapsed_opening #12585

Merged
merged 5 commits into from
Sep 25, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Sep 23, 2024

Which issue does this PR close?

Closes #12584.

Rationale for this change

This allows both Datafusion developers and users to better measure the impact of their optimizations. For example, I am working on #12547 (caching the page index), and my patch makes these metrics go from:

statistics_eval_time=775.283531ms, row_pushdown_eval_time=191ns, time_elapsed_scanning_until_data=16.463873ms, time_elapsed_processing=401.16043928s, time_elapsed_opening=473.912060225s, page_index_eval_time=24.19564ms, metadata_load_time=410.282903954s, bloom_filter_eval_time=62.038167937s, time_elapsed_scanning_total=16.516948ms

to

row_pushdown_eval_time=191ns, time_elapsed_scanning_until_data=1.143893882s, time_elapsed_processing=34.501308377s, time_elapsed_opening=67.214927741s, page_index_eval_time=32.002383ms, metadata_load_time=3.411246ms, bloom_filter_eval_time=66.495163165s, statistics_eval_time=1.255058963s, time_elapsed_scanning_total=1.143978104s

which shows it reduced metadata_load_time and time_elapsed_opening with little impact to the others.

What changes are included in this PR?

  1. Add metadata_load_time
  2. Addstatistics_eval_time and bloom_filter_eval_time
  3. Rename pushdown_eval_time to row_pushdown_eval_time, because statistics and eval time are pushdown filters too

Are these changes tested?

Not really. There is no unit tests, but it works in datafusion-cli and as a library

Are there any user-facing changes?

Renamed a metric, and added two others, visible in EXPLAIN ANALYZE.

As it does not measure evaluation of row group-level pushdown filters
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Sep 23, 2024
@progval progval changed the title parquet: Add metrics on operations covered by time_elapsed_opening parquet: Add finer metrics on operations covered by time_elapsed_opening Sep 23, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval

I have some small suggestions but I don't think they are required to merge this PR

@@ -43,14 +43,20 @@ pub struct ParquetFileMetrics {
pub pushdown_rows_pruned: Count,
/// Total rows passed predicates pushed into parquet scan
pub pushdown_rows_matched: Count,
/// Total time spent evaluating pushdown filters
pub pushdown_eval_time: Time,
/// Total time spent evaluating row-level pushdown filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside while I am reviewing this code, the description of what the metrics mean would be more useful if we could attach them to the Metric itself somehow 🤔

Something like:

        let row_pushdown_eval_time = MetricBuilder::new(metrics)
            .with_description("Total time spent evaluating row-level pushdown filters")
            .with_new_label("filename", filename.to_string()) 
            .subset_time("row_pushdown_eval_time", partition);

Though then we would have to figure out how to expose that information in an explain plan 🤔


if enable_bloom_filter && !row_groups.is_empty() {
let mut timer = file_metrics.bloom_filter_eval_time.timer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that the time calculation for page_index_eval_time below seems to be in the call to

                    access_plan = p.prune_plan_with_page_index(
                        access_plan,
                        &file_schema,
                        builder.parquet_schema(),
                        file_metadata.as_ref(),
                        &file_metrics,
                    );

I wonder if we should move the timer into this function for consistency (or alternatively push the timer calculation into prune_by_bloom_filters and prune_by_statistics)

Copy link
Contributor Author

@progval progval Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I did the latter, because it's probably closer to what users expect.

row_groups.prune_by_statistics(
&file_schema,
builder.parquet_schema(),
rg_metadata,
predicate,
&file_metrics,
);
timer.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a problem, but technically speaking the stop() is unecessary (it happens automatically when timer goes out of scope and is Droped

https://github.com/alamb/datafusion/blob/544e49bb0acac7130a873a92b44e1c902e41ac8f/datafusion/physical-plan/src/metrics/value.rs#L335-L339

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is needed, because it only goes out of scope after the Bloom Filter evaluation. The next timer.stop() is indeed redundant, but I kept it for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter anymore, as I moved both calls to the callees.

@alamb alamb merged commit b821929 into apache:main Sep 25, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

Thanks again @progval

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
…ning` (apache#12585)

* Rename 'pushdown_eval_time' to 'row_pushdown_eval_time'

As it does not measure evaluation of row group-level pushdown filters

* Add statistics_eval_time and bloom_filter_eval_time metrics

* Add metadata_load_time metric

* Add docs

* Move timers to the callees
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet: Refine time_elapsed_opening metric
2 participants