From d13f210c767670b62a4091ffc1d439cb5dad5720 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Sat, 5 Nov 2022 04:29:24 +0000 Subject: [PATCH 1/5] PARQUET-1222: Add details about sort order This adds details about primitive sort order the specification docs. See JIRA for discussion. --- README.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/README.md b/README.md index f5478c85..87bf90a4 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,25 @@ documented in [LogicalTypes.md][logical-types]. [logical-types]: LogicalTypes.md +### Sort Order + +Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, +etc). Comparison for values of a type follow the following logic: + +1. Each logical type has a specified comparison order. If a column is + annotated with an unknown logical type, statistics may not be used + for pruning data. The sort order for logical types is documented in + the [LogicalTypes.md][logical-types] page. +2. For primitives the following sort orders apply: + + * BOOLEAN - false, true + * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are + not totally ordered due to special case like NaN and infinity. They require special + handling when reading statistics. The details are documented in parquet.thrift in the + `ColumnOrder` union. + * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic Unsigned byte-wise comparisons. + + ## Nested Encoding To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the From fba6608d836004dec30ff6ce608fb0a84822bd73 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Mon, 5 Dec 2022 23:04:27 -0800 Subject: [PATCH 2/5] try to address feedback. --- README.md | 17 +++++++++++++++-- src/main/thrift/parquet.thrift | 7 +++++++ 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 87bf90a4..ee926c6a 100644 --- a/README.md +++ b/README.md @@ -157,9 +157,22 @@ etc). Comparison for values of a type follow the following logic: * BOOLEAN - false, true * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are - not totally ordered due to special case like NaN and infinity. They require special + not totally ordered due to special case like NaN. They require special handling when reading statistics. The details are documented in parquet.thrift in the - `ColumnOrder` union. + `ColumnOrder` union. They are summarized + here but parquet.thrift is considered authoritative: + * NaNs should not be written to min or max statistics fields. + * Only -0 should be written into min statistics fields (if only +0 is present in the column it should be converted to -0.0). + * Only +0 should be written into + a max statistics fields (if only -0 is present it must be convereted to +0). + + For backwards compatibility when reading files: + * If the min is a NaN, it should be ignored. + * If the max is a NaN, it should be ignored. + * If the min is +0, the row group may contain -0 values as well. + * If the max is -0, the row group may contain +0 values as well. + * When looking for NaN values, min and max should be ignored. + * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic Unsigned byte-wise comparisons. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 81a7cf82..c4d9516e 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -902,6 +902,13 @@ union ColumnOrder { * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. * - When looking for NaN values, min and max should be ignored. + * + * When writing statistics the following rules should be followed: + * - NaNs should not be written to min or max statistics fields. + * - Only -0 should be written into min statistics fields (if only + * +0 is present in the column it should be converted to -0.0). + * - Only +0 should be written into a max statistics fields (if + * only -0 is present it must be convereted to +0). */ 1: TypeDefinedOrder TYPE_ORDER; } From f33a3c3dbbc2e97b7b465aa4b557c62bab647989 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 6 Dec 2022 21:25:41 -0800 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Antoine Pitrou --- README.md | 11 +++++++---- src/main/thrift/parquet.thrift | 8 ++++---- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index ee926c6a..c7a52108 100644 --- a/README.md +++ b/README.md @@ -146,17 +146,20 @@ documented in [LogicalTypes.md][logical-types]. ### Sort Order -Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index, -etc). Comparison for values of a type follow the following logic: +Parquet stores min/max statistics at several levels (such as Column Chunk, +Column Index and Data Page). Comparison for values of a type obey the +following rules: 1. Each logical type has a specified comparison order. If a column is annotated with an unknown logical type, statistics may not be used for pruning data. The sort order for logical types is documented in the [LogicalTypes.md][logical-types] page. -2. For primitives the following sort orders apply: +2. For primitive types, the following rules apply: * BOOLEAN - false, true - * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values are + * INT32, INT64 - Signed comparison. + * FLOAT, DOUBLE - Signed comparison with special handling of NaNs + and signed zeros. The details are documented in... not totally ordered due to special case like NaN. They require special handling when reading statistics. The details are documented in parquet.thrift in the `ColumnOrder` union. They are summarized diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index c4d9516e..d602c683 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -905,10 +905,10 @@ union ColumnOrder { * * When writing statistics the following rules should be followed: * - NaNs should not be written to min or max statistics fields. - * - Only -0 should be written into min statistics fields (if only - * +0 is present in the column it should be converted to -0.0). - * - Only +0 should be written into a max statistics fields (if - * only -0 is present it must be convereted to +0). + * - If the computed max value is zero (whether negative or positive), + * `+0.0` should be written into the max statistics field. + * - If the computed min value is zero (whether negative or positive), + * `-0.0` should be written into the min statistics field. */ 1: TypeDefinedOrder TYPE_ORDER; } From 4b55e9ccab74af0511c2ddb2846f8b7b3fc4d436 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 6 Dec 2022 21:30:16 -0800 Subject: [PATCH 4/5] syncrhonize readme.md --- README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index c7a52108..caa1e2c6 100644 --- a/README.md +++ b/README.md @@ -158,16 +158,12 @@ following rules: * BOOLEAN - false, true * INT32, INT64 - Signed comparison. - * FLOAT, DOUBLE - Signed comparison with special handling of NaNs - and signed zeros. The details are documented in... - not totally ordered due to special case like NaN. They require special - handling when reading statistics. The details are documented in parquet.thrift in the + * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in parquet.thrift in the `ColumnOrder` union. They are summarized here but parquet.thrift is considered authoritative: * NaNs should not be written to min or max statistics fields. - * Only -0 should be written into min statistics fields (if only +0 is present in the column it should be converted to -0.0). - * Only +0 should be written into - a max statistics fields (if only -0 is present it must be convereted to +0). + * If the computed max value is zero (whether negative or positive), `+0.0` should be written into the max statistics field. + * If the computed min value is zero (whether negative or positive), `-0.0` should be written into the min statistics field. For backwards compatibility when reading files: * If the min is a NaN, it should be ignored. From b38ad853bf9e397796fb1825e64fcede5d8fc843 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Wed, 7 Dec 2022 08:54:00 +0100 Subject: [PATCH 5/5] Formatting nits --- README.md | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index caa1e2c6..99b05468 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ more pages. - Encoding/Compression - Page ## File format -This file and the [thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. +This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. 4-byte magic number "PAR1" @@ -104,7 +104,7 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be In the above example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. More details on what is contained in the metadata can be found -in the thrift definition. +in the Thrift definition. Metadata is written after the data to allow for single pass writing. @@ -158,21 +158,26 @@ following rules: * BOOLEAN - false, true * INT32, INT64 - Signed comparison. - * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in parquet.thrift in the - `ColumnOrder` union. They are summarized - here but parquet.thrift is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), `+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), `-0.0` should be written into the min statistics field. + * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and + signed zeros. The details are documented in the + [Thrift definition](src/main/thrift/parquet.thrift) in the + `ColumnOrder` union. They are summarized here but the Thrift definition + is considered authoritative: + * NaNs should not be written to min or max statistics fields. + * If the computed max value is zero (whether negative or positive), + `+0.0` should be written into the max statistics field. + * If the computed min value is zero (whether negative or positive), + `-0.0` should be written into the min statistics field. For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * If the min is a NaN, it should be ignored. + * If the max is a NaN, it should be ignored. + * If the min is +0, the row group may contain -0 values as well. + * If the max is -0, the row group may contain +0 values as well. + * When looking for NaN values, min and max should be ignored. - * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic Unsigned byte-wise comparisons. + * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise + comparison. ## Nested Encoding