Update Default Parquet Write Compression #7692

devinjdangelo · 2023-09-28T23:43:06Z

Which issue does this PR close?

Closes #7691

Rationale for this change

See issue for discussion

What changes are included in this PR?

Set default parquet writer to use zstd level 3 compression.

Are these changes tested?

By existing tests

Are there any user-facing changes?

Much smaller parquet files for a minor write performance penalty.

tustvold · 2023-09-29T09:14:27Z

I'm not sure about this, block compression is far from free, and many workloads particularly those using object storage would be willing to trade faster read performance for slightly larger objects.

devinjdangelo · 2023-09-29T11:56:46Z

@tustvold Going from uncompressed to even zstd(1) is likely to bring a 50% reduction in file size (up to 80% is not uncommon depending on the data). I have some of my own DataFusion specific benchmarks, but Uber Engineering has a nice public analysis on this topic here.

I added some benchmarks showing with/without compression here. Zstd(1)-Zstd(3) write speeds in the single threaded AsyncArrowWriter is 15-30% slower for a 50-60% reduction in file size. The parallel writer is only about 5% slower for the same reduction in file size. I have not compared read speeds on the uncompressed vs compressed files, though if you are I/O limited to a remote ObjectStore it is possible for compression to improve read performance.

Another argument in favor of this change is that most other popular frameworks with parquet write support default to compression of either snappy (compatibility) or zstd (best performance), so users of DataFusion imo will not expect the default to be uncompressed.

Apache Spark: snappy
Polars: zstd
Pandas: snappy
DuckDb: snappy

The Datafusion default is not all that important for systems/database developers, since those users almost certainly will tune the settings to their use case anyway. This is more important for efforts to gain adoption of direct users using Datafusion for analysis/ETL workloads, such as via the datafusion-cli or python bindings.

tustvold · 2023-09-29T12:04:17Z

Whats the performance hit like for read? A write speed reduction is not a huge deal to me, it is read performance that matters as this has will have a direct impact on query latency.

FWIW LZ4_RAW is probably the "best" codec, but ecosystem support is limited...

devinjdangelo · 2023-09-29T12:29:11Z

I ran a quick test using Query1, reading the uncompressed parquet file vs. zstd compressed. This is using local SSD based storage.

Uncompressed: 1.8393s
Zstd: 1.8297s

The performance is nearly identical. I averaged each over 50 runs for the above numbers and they are converging towards <1% performance difference. Variance run-to-run is ~5% so on a single run either one can be faster.

Script:

import time
from datafusion import SessionContext

t = time.time()

#uncompressed file, ~3.6Gb on disk
#file = "/home/dev/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet"

#zstd compressed file, ~1.6Gb on disk
file = "/home/dev/arrow-datafusion/test_out/benchon.parquet"

# Create a DataFusion context
ctx = SessionContext()

# Register table with context
ctx.register_parquet('test', file)

times = []
for i in range(50):
  t = time.time()
    query = """
    select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        test
    where
            l_shipdate <= date '1998-09-02'
    group by
        l_returnflag,
        l_linestatus
    order by
        l_returnflag,
        l_linestatus
    """

    # Execute SQL
    df = ctx.sql(f"{query}")
    df.show()
    elapsed = time.time() - t
    times.append(elapsed)
    print(f"datafusion agg query {elapsed}s")

print(sum(times)/len(times))

tustvold · 2023-09-29T12:32:05Z

The performance is nearly identical

Can you run a test that is actually bottlenecked on parquet, e.g. a predicated scan, as opposed to something with sorts and groups by in it that will dominate pretty much anything else other than joins.

devinjdangelo · 2023-09-29T12:53:14Z

I was also a bit suspicious of how identical the performance was. I caught a mistake in my set up, both numbers above were for ZSTD not uncompressed. I corrected the mistake and ran two more tests below. Indeed, using local storage uncompressed reads are a good bit faster. I would be interested to compare this to remote object storage where bandwidth may be more of a bottleneck..

Removing the group by / order by:

Uncompressed: 0.5305
Zstd: 0.7015

New Query:

    select
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        test
    where
        l_shipdate <= date '1998-09-02'

I also timed caching the entire parquet file into memory (select * from test), then df.cache(). I only did average over 5 runs this time.

Uncompressed: 6.818s
Zstd: 10.052

tustvold · 2023-09-29T12:58:46Z

remote object storage where bandwidth

It is more typically first-byte latency, not bandwidth that hurts with OS, so the size of the pages can end up being less relevant than you might expect...

I dunno, I suspect there is no "correct" answer to this, which leads me to be tempted to just leave it as is, but don't feel strongly, so if other people do...

devinjdangelo · 2023-09-29T13:26:27Z

I agree that there is no universally optimal choice and am also interested in more opinions.

For the bandwidth concern, I'm thinking more about a user who installs the python bindings to their macbook pro and reads from S3. The bottleneck there may be their likely 1gbps network connection. Even in a server environment if you are running on a single node, bandwidth could be limiting.

alamb

I think defaulting to basic and fast block level compression is much less surprising than NO compression for most users and therefore I think this is a good change.

While there may be corner cases where no compression is desired as @tustvold mentions, I think they are somewhat specialized

Maybe some other contributors have opinions on the matter (@Dandandan perhaps?)

alamb · 2023-09-29T14:32:40Z

I took the liberty of merging up from main to get #7701 and solve the failing CI check

Dandandan · 2023-09-30T05:22:42Z

I think defaulting to basic and fast block level compression is much less surprising than NO compression for most users and therefore I think this is a good change.

While there may be corner cases where no compression is desired as @tustvold mentions, I think they are somewhat specialized

Maybe some other contributors have opinions on the matter (@Dandandan perhaps?)

I agree. Parquet is designed for block compression. And people want to optimize for long term storage cost as well. Zstd is probably one of the better defaults for use cases @devinjdangelo describes (data pipelines).

Dandandan · 2023-09-30T05:23:11Z

Thank you @devinjdangelo

* update compression default * fix tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

devinjdangelo added 2 commits September 28, 2023 19:41

update compression default

b537c3a

fix tests

ad36c84

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 29, 2023

alamb approved these changes Sep 29, 2023

View reviewed changes

Merge remote-tracking branch 'apache/main' into default_compression

a4e4a88

Dandandan approved these changes Sep 30, 2023

View reviewed changes

Dandandan merged commit 692ea24 into apache:main Sep 30, 2023
23 checks passed

Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023

Update Default Parquet Write Compression (apache#7692)

0bce7a5

* update compression default * fix tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

andygrove added the enhancement New feature or request label Oct 7, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

devinjdangelo mentioned this pull request Jul 20, 2024

Change default Parquet writer settings to match arrow-rs (except for compression & statistics) #11558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Default Parquet Write Compression #7692

Update Default Parquet Write Compression #7692

devinjdangelo commented Sep 28, 2023

tustvold commented Sep 29, 2023 •

edited

Loading

devinjdangelo commented Sep 29, 2023 •

edited

Loading

tustvold commented Sep 29, 2023

devinjdangelo commented Sep 29, 2023 •

edited

Loading

tustvold commented Sep 29, 2023 •

edited

Loading

devinjdangelo commented Sep 29, 2023

tustvold commented Sep 29, 2023

devinjdangelo commented Sep 29, 2023

alamb left a comment

alamb commented Sep 29, 2023

Dandandan commented Sep 30, 2023

Dandandan commented Sep 30, 2023

Update Default Parquet Write Compression #7692

Update Default Parquet Write Compression #7692

Conversation

devinjdangelo commented Sep 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold commented Sep 29, 2023 • edited Loading

devinjdangelo commented Sep 29, 2023 • edited Loading

tustvold commented Sep 29, 2023

devinjdangelo commented Sep 29, 2023 • edited Loading

tustvold commented Sep 29, 2023 • edited Loading

devinjdangelo commented Sep 29, 2023

tustvold commented Sep 29, 2023

devinjdangelo commented Sep 29, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 29, 2023

Dandandan commented Sep 30, 2023

Dandandan commented Sep 30, 2023

tustvold commented Sep 29, 2023 •

edited

Loading

devinjdangelo commented Sep 29, 2023 •

edited

Loading

devinjdangelo commented Sep 29, 2023 •

edited

Loading

tustvold commented Sep 29, 2023 •

edited

Loading