diff --git a/README.md b/README.md index f5478c85..99b05468 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ more pages. - Encoding/Compression - Page ## File format -This file and the [thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. +This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. 4-byte magic number "PAR1" @@ -104,7 +104,7 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be In the above example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. More details on what is contained in the metadata can be found -in the thrift definition. +in the Thrift definition. Metadata is written after the data to allow for single pass writing. @@ -144,6 +144,42 @@ documented in [LogicalTypes.md][logical-types]. [logical-types]: LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (such as Column Chunk, +Column Index and Data Page). Comparison for values of a type obey the +following rules: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md][logical-types] page. +2. For primitive types, the following rules apply: + + * BOOLEAN - false, true + * INT32, INT64 - Signed comparison. + * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and + signed zeros. The details are documented in the + [Thrift definition](src/main/thrift/parquet.thrift) in the + `ColumnOrder` union. They are summarized here but the Thrift definition + is considered authoritative: + * NaNs should not be written to min or max statistics fields. + * If the computed max value is zero (whether negative or positive), + `+0.0` should be written into the max statistics field. + * If the computed min value is zero (whether negative or positive), + `-0.0` should be written into the min statistics field. + + For backwards compatibility when reading files: + * If the min is a NaN, it should be ignored. + * If the max is a NaN, it should be ignored. + * If the min is +0, the row group may contain -0 values as well. + * If the max is -0, the row group may contain +0 values as well. + * When looking for NaN values, min and max should be ignored. + + * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise + comparison. + + ## Nested Encoding To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 81a7cf82..d602c683 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -902,6 +902,13 @@ union ColumnOrder { * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. * - When looking for NaN values, min and max should be ignored. + * + * When writing statistics the following rules should be followed: + * - NaNs should not be written to min or max statistics fields. + * - If the computed max value is zero (whether negative or positive), + * `+0.0` should be written into the max statistics field. + * - If the computed min value is zero (whether negative or positive), + * `-0.0` should be written into the min statistics field. */ 1: TypeDefinedOrder TYPE_ORDER; }