Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPCH, Query 18 and 17 very slow #5646

Closed
djouallah opened this issue Mar 20, 2023 · 45 comments
Closed

TPCH, Query 18 and 17 very slow #5646

djouallah opened this issue Mar 20, 2023 · 45 comments
Labels
bug Something isn't working

Comments

@djouallah
Copy link

djouallah commented Mar 20, 2023

was running a TPCH_SF5 just for fun, I notice query 17 and 18 are very slow

full reproducible example

another issue when I increase sf to 10, I start getting OOM errors ?

https://colab.research.google.com/drive/1WJ2ICxJyAYClkDx8guGX-TcOMnf8SBxr#scrollTo=z494Cl6XKUVX

DataFusion 26 gives the following result against duckdb

image

@djouallah djouallah added the bug Something isn't working label Mar 20, 2023
@mingmwang
Copy link
Contributor

I will take a look.

@mingmwang
Copy link
Contributor

Working on it now. I am not sure whether it is regression or those two queries are always slow.

@mingmwang
Copy link
Contributor

mingmwang commented Mar 29, 2023

The bottle neck of q17 should be Aggregation.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 3, 2023

A profile run using flamegraph shows on my machine:

arrow_cast::cast::cast_decimal_to_decimal is consuming about 1/4 of the time of q17.

@Dandandan
Copy link
Contributor

flamegraph

@Dandandan
Copy link
Contributor

FYI @viirya

@mingmwang
Copy link
Contributor

One reason for so many downcast_value call is because the grouping column l_partkey is with high cardinality, causing the vectorization is almost useless.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 4, 2023

The most expensive part is the line let values = &cast(values, sum_type)? in sum_batch which performs this casting.

I guess we should move that evaluation (or other parts of sum_batch as well) up, so it will be done before the grouping, rather than after, so it still is vectorized.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 4, 2023

Another observation I have is that the plan does some unnecessary casting:

Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Projection: lineitem.l_extendedprice
      Inner Join: part.p_partkey = __scalar_sq_3.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_3.__value AS Decimal128(30, 15))
        Projection: lineitem.l_quantity, lineitem.l_extendedprice, part.p_partkey
          Inner Join: lineitem.l_partkey = part.p_partkey
            TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
            Projection: part.p_partkey
              Filter: part.p_brand = Utf8("Brand#23") AND part.p_container = Utf8("MED BOX")
                TableScan: part projection=[p_partkey, p_brand, p_container], partial_filters=[part.p_brand = Utf8("Brand#23"), part.p_container = Utf8("MED BOX")]
        SubqueryAlias: __scalar_sq_3
          Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
            Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
              TableScan: lineitem projection=[l_partkey, l_quantity]
  1. inside the aggregation (in sum_batch)
  2. cast to float for __scalar_sq_3.__value. I guess because 0.2 is assumed to be a float.
  3. cast back to decimal for join filter

@mingmwang
Copy link
Contributor

@Dandandan
#5866

@jackwener
Copy link
Member

jackwener commented Apr 5, 2023

Another observation I have is that the plan does some unnecessary casting:

@Dandandan , Yes, I also notice this problem, it's related with type coercion.

It isn't a easy problem, type coercion in pg is a top down process, so top can request type from children.
It's similar with interesting order/enforcement. But current datafusion do type coercion like mysql bottom up, it isn't good enough.
I prepare to improve type coercion in the future according to PG and Spark.

BTW, #5831 also is related with this q17, it move cast from expression eval into subplan.

@Dandandan
Copy link
Contributor

@jackwener Nice, thank you

@Dandandan
Copy link
Contributor

@jackwener this is all resolved now, right?

@djouallah
Copy link
Author

datafusion is doing a great progress but still, it is not solved
Version: 26.0.0

image

@mingmwang
Copy link
Contributor

I am still working on it.

@Dandandan
Copy link
Contributor

Dandandan commented Jun 25, 2023

Query 18 should is also considerably faster in the next DataFusion version, because of join-related improvements (datastructure improvement and vectorized collision checks).
I'll be looking into what we can do further.

@Dandandan
Copy link
Contributor

One suggestion that will yield some (smaller) performance improvement for query 18 (and most other queries): #6768

@alamb
Copy link
Contributor

alamb commented Jun 30, 2023

There is quite a lot of recent work / proposals to make the grouping significantly faster for these queries. See #4973

@djouallah
Copy link
Author

using Python_datafusion 27, unfortunately still issues with Query 17, my VM has 8 cores and 64 GB of RAM, Query 17 got OOM

image

@alamb
Copy link
Contributor

alamb commented Jul 9, 2023

I expect Q17 to go about 2x faster and use much less memory when we merge our most recent work -- see #6800 (comment) for details

@djouallah
Copy link
Author

djouallah commented Jul 9, 2023

i am doing this experimentation using fabric notebook, datafusion doing alright, would love really to start seeing numbers for 8 cores, as currently with 8 cores has 64 GB yet DF has an OOM :(
image

@Dandandan
Copy link
Contributor

Dandandan commented Jul 10, 2023

Thanks @djouallah - the new GroupHashAggregate approach will also (drastically) reduce memory usage.

@alamb
Copy link
Contributor

alamb commented Jul 10, 2023

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys

@djouallah
Copy link
Author

@alamb the graph show the overall duration to finish toch_sf100 based on the number of cores, Datafusion is faster than spark even when using only 1 VM ;)

@alamb
Copy link
Contributor

alamb commented Jul 17, 2023

If you get a chance to test with the latest datafusion (will be in 28.0.0, eta probably next week) I expect performance for high cardinality grouping to be much better due to #6904

@alamb
Copy link
Contributor

alamb commented Jul 17, 2023

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

I wrote up an issue describing this here: #6937

@djouallah
Copy link
Author

djouallah commented Aug 3, 2023

using version 28, query 8 start getting errors

https://colab.research.google.com/drive/1KzofqAWJxVTboNcywGxSbIgLNatkIsM2

Query8
Arrow error: Compute error: Overflow happened on: 136581719431 * 100000000000000000000000000000000000000

@Dandandan
Copy link
Contributor

Sounds like #6794
Cc @viirya

@djouallah
Copy link
Author

djouallah commented Aug 3, 2023

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

image

@viirya
Copy link
Member

viirya commented Aug 3, 2023

Sounds like #6794 Cc @viirya

I suppose that decimal multiplication and division precision change at #6832 will fix that.

@Dandandan
Copy link
Contributor

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

Nice, getting close to DuckDB. hyper is incredible for some queries!

@djouallah
Copy link
Author

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

@Dandandan
Copy link
Contributor

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

Hard to say in general, but they do some optimizations we don't do or do better planning (e.g. for join selection).
Using a different format instead of parquet might help as well as parquet can be slower to decode / decompress than formats optimized for query performance.

@alamb
Copy link
Contributor

alamb commented Aug 3, 2023

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

@Dandandan
Copy link
Contributor

Dandandan commented Aug 4, 2023

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

Hyper-db is developed by Tableau now, so it probably has some improvements over the last years compared to the "research system":
Some release notes for last couple of years: https://tableau.github.io/hyper-db/docs/releases

@djouallah
Copy link
Author

djouallah commented Sep 14, 2023

it seems there is a regression with query 18, it used to works fine with tpch100 using 124 GB of RAM, now the notebook crashes when using DF31 !!!

edit : never mind, it was a temporary glitch,

@alamb
Copy link
Contributor

alamb commented Sep 19, 2023

Given the long history of this issue, I think it is hard to understand what, if anything, it is tracking. I suggest we close it and file another issue to continue discussing additional performance improvements

@djouallah
Copy link
Author

the issue is very clear, performance of Query 17 and 18 is still very slow compared to other in process engines !!
image

@djouallah
Copy link
Author

Btw, DF31 is making a great progress, my impression as long as the data fit in memory, the performance is very similar to DuckDB, here I am reading Parquet files from an Azure storage, the main issue start with core 8 which has 64 of GB, Query 18 crash the notebook
image

@alamb
Copy link
Contributor

alamb commented Oct 17, 2023

BTW I think the core problem here is that DataFusion's parallel hash grouping builds the entire hash table for each input partition -- thus it requires memory proportional to the number of cores. This is tracked more in #6937

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

I spent some time analyzing why Q17 and Q18 are slow in detail this morning (in the context of #6782 ). My analysis shows we could close most of the gap with a better join order:

@korowa
Copy link
Contributor

korowa commented Nov 6, 2023

The issue seems to be (at least partially) related to FilterExec returning unknown statistics in case of unsupported filter predicates -- I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics -- it seems to be enough to fix Q17 plan.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2023

I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics

Perhaps the filter can simply switch to Precision::Inexact when it can't analyze the selectivity for the expression

Another thing I have seen in the past is heuristically pick a constant selectivity (assume it filters 1/2 the rows). However I think this leads to non-robust plans (sometimes the plans are good, sometimes they are bad, and it is hard to predict when each is hit)

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

I filed #8078 with a proposal of a more precise way to represent inexact statistics

@djouallah
Copy link
Author

Thank you very much for your works, I am happy with datafusion 33 performance, now it does finish TPCH_SF100 using 64 GB of RAM in Fabric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants