Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPC-H] Query 21 times out at scale 100 #1362

Open
hendrikmakait opened this issue Feb 6, 2024 · 2 comments
Open

[TPC-H] Query 21 times out at scale 100 #1362

hendrikmakait opened this issue Feb 6, 2024 · 2 comments
Labels
bug Something isn't working tpch

Comments

@hendrikmakait
Copy link
Member

From a preliminary look at the optimized graph, one issue might be that we don't properly push projections into the parquet reads:

Snippet from the graph:

Projection: columns=['l_orderkey', 'l_suppkey']
    FusedIO:
        ReadParquet: path='./tpch-data/scale-10/lineitem' columns=['l_orderkey', 'l_suppkey', 'l_commitdate', 'l_receiptdate'] filesystem=None kwargs={'dtype_backend': None}

I'd expect ReadParquet to only read ['l_orderkey', 'l_suppkey']. Combined with dask/dask-expr#854, this appears to be fairly catastrophic.

@hendrikmakait hendrikmakait added bug Something isn't working tpch labels Feb 6, 2024
@phofl
Copy link
Contributor

phofl commented Feb 6, 2024

I don't see this as a bug at the moment, we aim to only use one read_parquet call per data source to avoid reading the same columns more than once. This is a specialised example since we actually could separate them out, but I think that this is a little bit of an edge case. I wouldn't focus too much time on this at the moment, although I agree with you that we could be smarter about it. The problem is more that loses in value if there are operations in between read_parquet and the column restriction (like replace, shuffle, ...), since we would do the ops twice in this case. We can certainly make this special case better, but I am not sure if this would help us much in the grand scheme of things

@milesgranger
Copy link
Contributor

Maybe related to the new CI failure #1363?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tpch
Projects
None yet
Development

No branches or pull requests

3 participants