Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1222: [Format] Add details about sort order to README.md #185

Merged
merged 5 commits into from
Dec 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ more pages.
- Encoding/Compression - Page

## File format
This file and the [thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
Expand All @@ -104,7 +104,7 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be
In the above example, there are N columns in this table, split into M row
groups. The file metadata contains the locations of all the column metadata
start locations. More details on what is contained in the metadata can be found
in the thrift definition.
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.

Expand Down Expand Up @@ -144,6 +144,42 @@ documented in [LogicalTypes.md][logical-types].

[logical-types]: LogicalTypes.md

### Sort Order

Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index and Data Page). Comparison for values of a type obey the
following rules:

1. Each logical type has a specified comparison order. If a column is
annotated with an unknown logical type, statistics may not be used
for pruning data. The sort order for logical types is documented in
the [LogicalTypes.md][logical-types] page.
2. For primitive types, the following rules apply:

* BOOLEAN - false, true
* INT32, INT64 - Signed comparison.
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
signed zeros. The details are documented in the
[Thrift definition](src/main/thrift/parquet.thrift) in the
`ColumnOrder` union. They are summarized here but the Thrift definition
is considered authoritative:
* NaNs should not be written to min or max statistics fields.
* If the computed max value is zero (whether negative or positive),
`+0.0` should be written into the max statistics field.
* If the computed min value is zero (whether negative or positive),
`-0.0` should be written into the min statistics field.

For backwards compatibility when reading files:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.

* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
comparison.


## Nested Encoding
To encode nested columns, Parquet uses the Dremel encoding with definition and
repetition levels. Definition levels specify how many optional fields in the
Expand Down
7 changes: 7 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -902,6 +902,13 @@ union ColumnOrder {
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
* - NaNs should not be written to min or max statistics fields.
* - If the computed max value is zero (whether negative or positive),
* `+0.0` should be written into the max statistics field.
* - If the computed min value is zero (whether negative or positive),
* `-0.0` should be written into the min statistics field.
*/
1: TypeDefinedOrder TYPE_ORDER;
}
Expand Down