Spec: Support geo type #10981

szehon-ho · 2024-08-20T20:33:43Z

This is the spec change for #10260.

Also this is based closely on the decisions taken in the Parquet proposal for the same : apache/parquet-format#240

szehon-ho · 2024-08-20T20:38:21Z

format/spec.md

+XZ2 is based on the paper [XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extensions].
+
+Notes:
+1. Resolution must be a positive integer.  Defaults to TODO


@jiayuasu do you have any suggestion for default here?

GeoMesa has 12 as the default: https://github.com/locationtech/geomesa/blob/geomesa-4.0.3/geomesa-z3/src/main/scala/org/locationtech/geomesa/curve/XZSFC.scala#L13

12 sounds fine. CC @Kontinuation

GeoMesa uses a high XZ2 resolution when working with key-value stores such as Accumulo and HBase, it is not appropriate to always use a resolution that high for partitioning data (for instance, GeoMesa on FileSystems).

XZ2 resolution 11~12 works for city-scale data, but will generate too many partitions for country-scale or world-scale data. I'd like to have a smaller default value such as 7 to be safe on various kinds of data.

szehon-ho · 2024-08-20T20:54:40Z

format/spec.md

+| **`struct`**       | `group`                                                            |                                             |                                                                                                         |
+| **`list`**         | `3-level list`                                                     | `LIST`                                      | See Parquet docs for 3-level representation.                                                            |
+| **`map`**          | `3-level map`                                                      | `MAP`                                       | See Parquet docs for 3-level representation.                                                            |
+| **`geometry`**     | `binary`                                                           | `GEOMETRY`                                  | WKB format, see Appendix G. Logical type annotation optional for supported Parquet format versions [1]. |


I could add this section later too, once its implemented (same for ORC below)

[Appendix G](#appendix-g)

rdblue · 2024-08-20T21:37:31Z

format/spec.md

+| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column                                                                                                                                                                                                                                      |
+| _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column                                                                                                                                                                                                                                       |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts                                                                             |
+| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For Geometry type, this is a Point composed of the min value of each dimension in all Points in the Geometry.   |


How does this work? Does Iceberg need to interpret each WKB to produce this value? Will it be provided by Parquet?

Yes, once we switch to Geometry logical type from Parquet we will get these stats from Parquet.

Should we mention that it is the parquet type BoundingBox?

yea will add a footnote here

@szehon-ho BTW, the reason why we had a separate bbox statistics in havasu is to be compatible with existing Iceberg tables. Since this is to add the native geometry support, so lower_bound and upper_bound are good choices.

I thought that the bounds were stored as WKB-encoded points (according to Appendix D and G), and WKB encodes dimensions of geometries in the header. It is more consistent to make bound values the same type/representation as the field data type.

More sophisticated coverings in Parquet statistics cannot be easily mapped to lower_bounds and upper_bounds, so do we simply use the bbox statistics and ignore the coverings for now?

I think it is ok since these two bounds are optional and in case they are not presented, it still follow the spec.

Does it support different dimensions like XY, XYZ, XYM, XYZM? If yes, how can we tell if the binary is for XYZ or XYM?

We should say For Geometry type, this is a WKB-encoded Point composed of the min value of each dimension in all Points in the Geometry. Then we don't have to worry about the Z and M value.

CC @szehon-ho

We should say For Geometry type, this is a WKB-encoded Point composed of the min value of each dimension in all Points in the Geometry. Then we don't have to worry about the Z and M value.

@jiayuasu @Kontinuation @wgtmac Done, thanks.

@flyrain

Should we mention that it is the parquet type BoundingBox?

Actually looking again after some time, not sure how to mention this here, as that is filetype specific. This is an optional field, and only set if type is parquet and bounding_box is set, but that's implementation detail .

format/spec.md

rdblue · 2024-08-20T21:40:05Z

format/spec.md

-| **`void`**        | Always produces `null`                                       | Any                                                                                                       | Source type or `int` |
+| Transform name    | Description                                                  | Source types                                                                                                                               | Result type          |
+|-------------------|--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
+| **`identity`**    | Source value, unmodified                                     | Any                                                                                                                                        | Source type          |


Except for geometry?

@rdblue @flyrain hmm is there a specific reason? I think it could work technically as Geometry can be compared, unless I miss something

Maybe that's fine if it is comparable, but practically people will always use xz2, right? I'm not sure, but wondering if there is some implications, e.g., too expensive, or super high cardinality, so that we don't recommend user to use the original GEO value as the partition spec.

yea I think its possible to do (it's just the wkb value after all), you are right , not sure if any good use case. Yea we have to get the wkb in any case, i am not sure if its that expensive, but can check. But I guess the cardinality is the same consideration as any other type (uuid for example), and we let the user choose ?

format/spec.md

flyrain · 2024-08-20T23:15:07Z

format/spec.md

+| Transform name    | Description                                                  | Source types                                                                                                                               | Result type          |
+|-------------------|--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
+| **`identity`**    | Source value, unmodified                                     | Any                                                                                                                                        | Source type          |
+| **`bucket[N]`**   | Hash of value, mod `N` (see below)                           | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`                |


Are we going to support bucketing on GEO?

I think its possible, again not sure the utility. Geo boils down to just WKB bytes

I feel that the argument for identity can apply here as well. In that case, we can support it, but it's users' call to use it or not.

We may want to change this to be like identity, using Any except [...].

I would not include geo as a source column for bucketing because there is not a clear definition of equality for geo. The hash would depend on the structure of the object and weird things happen when two objects are "equal" (for some definition) but have different hash values.

jiayuasu · 2024-08-20T23:56:57Z

format/spec.md

@@ -198,6 +199,9 @@ Notes:
    - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical).
    - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution.


When we say OGC:CRS84, the value you put in this field should be the following PROJJSON string (see GeoParquet spec)

{ "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84 longitude-latitude", "datum": { "type": "GeodeticReferenceFrame", "name": "World Geodetic System 1984", "ellipsoid": { "name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563 } }, "coordinate_system": { "subtype": "ellipsoidal", "axis": [ { "name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree" }, { "name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree" } ] }, "id": { "authority": "OGC", "code": "CRS84" } }

Both crs and crs_kind field are optional. But when the CRS field presents, the crs_kind field must present. In this case, since we hard code this crs field in this phase, then we need to set crs_kind field (string) to PROJJSON.

Should we include this example in the parquet spec as well?

BTW, should we advise accepted forms or values for CRS and Edges?

@wgtmac We should add this value to the Parquet spec for sure. CC @zhangfengcdt

@szehon-ho There is another situation mentioned in the GeoParquet spec: If the CRS field presents but its value is null, it means the data is in unknown CRS. This situation happens sometimes because the writer somehow cannot find or lose the CRS info. Do we want to support this? I think we can use the empty string to cover this case

accepted forms or values for CRS and Edges

If we borrow the conclusion from Parquet Geometry proposal, then C,T,E fields are the follows:

C is a string. Based on what I understand from this PR, @szehon-ho made this field a required field, which is fine.
T is optional and a string. Currently, it only allows this value PROJJSON. When it is not provided, it defaults to PROJJSON too.
E is a string. The only allowed value is PLANAR in this phase. Based on what I understand from this PR, @szehon-ho made this field a required field, which is fine. @szehon-ho According to our meeting with Snowflake, I think maybe we can allow SPHERICAL too? We can add in the spec that: currently it is unsafe to perform partition transform / bounding box filtering when E = SPHERICAL because they are built based on PLANAR edges. It is the reader's responsibility to decide if they want to use partition transform / bounding box filtering.

BTW, the parquet-format uses crs_encoding instead of crs_kind. Do we want to unify the names as well?

What is the expectation from C, T and E fields from the Parquet/ORC data files? Are they required to be set by Iceberg? From the Parquet spec, only E is required, both C and T are optional.

Changed from type to encoding.

Actually i made all three have default values (C="OGC:CRS84", CE="PROJJSON", E="planar"), will that make sense? Trying to make the common case be less verbose. @wgtmac @jiayuasu

@szehon-ho

Does it make sense to include the following CRS84 example from the Parquet Geometry PR?

/** * Coordinate Reference System, i.e. mapping of how coordinates refer to * precise locations on earth. Writers are not required to set this field. * Once crs is set, crs_encoding field below MUST be set together. * For example, "OGC:CRS84" can be set in the form of PROJJSON as below: * { * "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json", * "type": "GeographicCRS", * "name": "WGS 84 longitude-latitude", * "datum": { * "type": "GeodeticReferenceFrame", * "name": "World Geodetic System 1984", * "ellipsoid": { * "name": "WGS 84", * "semi_major_axis": 6378137, * "inverse_flattening": 298.257223563 * } * }, * "coordinate_system": { * "subtype": "ellipsoidal", * "axis": [ * { * "name": "Geodetic longitude", * "abbreviation": "Lon", * "direction": "east", * "unit": "degree" * }, * { * "name": "Geodetic latitude", * "abbreviation": "Lat", * "direction": "north", * "unit": "degree" * } * ] * }, * "id": { * "authority": "OGC", * "code": "CRS84" * } * } */

It is ok to have them all fixed to default values for this phase.

@jiayuasu i put it in the example (if you render the page). let me know if its not what you meant.

wgtmac · 2024-08-21T05:53:01Z

format/spec.md

@@ -190,6 +190,7 @@ Supported primitive types are defined in the table below. Primitive types added
 |                  | **`uuid`**         | Universally unique identifiers                                           | Should use 16-byte fixed                         |
 |                  | **`fixed(L)`**     | Fixed-length byte array of length L                                      |                                                  |
 |                  | **`binary`**       | Arbitrary-length byte array                                              |                                                  |
+| [v3](#version-3) | **`geometry(C, T, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the Geometry subclasses defined therein; coordinate reference system C [4], coordinate reference system type T [5], edges E [6] | C, T, E are fixed. Encoded as WKB, see Appendix G.   |


What syntax to use for an engine to create the geometry type? Does it require C/T/E to appear in the type?

Related to above comment, I think these will all be optional (take a default value if not specified).

format/spec.md

szehon-ho · 2024-08-24T00:39:47Z

Hi all fyi i have unfortunately encountered some problems while remote and probably cant update this, will come back to this after i get back home in two weeks.

szehon-ho · 2024-09-11T21:52:06Z

@jiayuasu @Kontinuation @wgtmac @flyrain @rdblue sorry for the delay, as I only got access now. Updated the pr.

dmitrykoval · 2024-09-18T14:02:21Z

format/spec.md


 Notes:

 1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. Crs (coordinate reference system) is a mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution.
+5. Crs-encoding (coordinate reference system encoding) is the type of crs field. Must be set if crs is set. Defaults to "PROJJSON". Fixed and cannot be changed by schema evolution.
+6. Edges is the interpretation for non-point geometries in geometry object, i.e. whether an edge between points represent a straight cartesian line or the shortest line on the sphere. Defaults to "planar". Fixed and cannot be changed by schema evolution.


Can we maybe explicitly mention here that both "planar" and "spherical" are supported as edge type enum values?

@dmitrykoval i was debating this.

I guess we talked about it before, but the Java reference implementation, we cannot easily do pruning (file level, row level, or partition level) because the JTS library and the XZ2 only support non-spherical. We would need new metrics types, new Java libraries , and new partition transform proposals if we wanted to support it in Java reference implementation.

But if we want to support it, Im ok to list it here and have checks to just skip pruning for spherical geometry columns.
@flyrain @jiayuasu @Kontinuation does it make sense?

I see. I think if "planar" is the default edge type, then there shouldn't be many changes to the planar geometry code path, except for additional checks to skip some partitioning/pruning cases, right?

Regarding the reference implementation of the "spherical" type, do we need to fully support it from day one, or can we maybe mark it as optional in the initial version of the spec? For example, it would work if the engine supports it, but by default, we would fall back to the planar edge type?

We could list spherical as an allowed edge type here. Maybe just mark it that it is not safe to perform partition transform or lower_bound/upper_bound filtering when the edge is spherical. We did the same in the Parquet Geometry PR.

Yea , forgot to mention explicitly that in Iceberg, pruning is always an optional feature for reads, so no issue.

wgtmac · 2024-09-19T01:50:23Z

format/spec.md

+| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column                                                                                                                                                                                                                                                                                     |
+| _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column                                                                                                                                                                                                                                                                                      |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts                                                                                                                            |
+| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file.    |


For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file.

As we are finishing the PoC on the Parquet side, the remaining issue is what value to write to min_value/max_value fields of statistics and page index. To give some context, Parquet requires min_value/max_value fields to be set for page index and statistics are used to generate page index. The C++ PoC is omitting min_value/max_value values and the Java PoC is pretending geometry values are plain binary values while collecting the stats. Should we do similar things here? Then the Iceberg code can directly consume min_value/max_value from statistics instead of issuing another call to get the specialized GeometryStatistics which is designed for advanced purpose.

@Kontinuation @zhangfengcdt @paleolimbot @jiayuasu

@wgtmac do you mean that Iceberg uses the Parquet Geometry GeometryStatistics or Parquet Geometry uses the min_value/max_value idea from Iceberg?

I mean the latter. The ColumnOrder of the new geometry type is undefined as specified at https://github.com/apache/parquet-format/pull/240/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1144. It means that the min_value/max_value fields are meaningless and should not be used. I'm not sure if it is a good idea to set min_value/max_value fields in the same way as lower_bounds/upper_bounds of Iceberg.

I suggest defining the sort order of geometry columns as WKB-encoded points in the parquet format spec. This is the most simple yet useful way of defining the min and max bounds for geometry columns, and the sort order is better to be well-defined rather than left undefined.

I agree that it is better to explicitly define the column order than being undefined. If we go with this approach, the format PR and two PoC impls need to reflect this change, which might get more complicated.

Is there anything for this specific line we need to change? As long as we get from Parquet some way we are ok here, but is the format of the lower/upper bound still ok?

No, I was thinking if Parquet could do better by doing similar things in the future.

RussellSpitzer · 2024-09-19T21:35:46Z

format/spec.md

@@ -200,12 +200,16 @@ Supported primitive types are defined in the table below. Primitive types added
 |                  | **`uuid`**         | Universally unique identifiers                                           | Should use 16-byte fixed                         |
 |                  | **`fixed(L)`**     | Fixed-length byte array of length L                                      |                                                  |
 |                  | **`binary`**       | Arbitrary-length byte array                                              |                                                  |
+| [v3](#version-3) | **`geometry(C, CE, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the geometry subclasses defined therein; crs C [4], crs-encoding CE [5], edges E [6] | C, CE, E are fixed, and if unset will take default values. |


I think maybe we should just link out for the requirements here since it's a bit complicated.

The description as well could be

Simple feature geometry Appendix G, Parameterized by ....

I also don't think we should allow it to be unset ... can we just require that a subclass is always picked? We could recommend a set of defaults for engines to set on field creation but I'm not sure we need to be that opinionated here.

RussellSpitzer · 2024-09-19T21:52:22Z

format/spec.md

@@ -1312,7 +1335,7 @@ This serialization scheme is for storing single values as individual binary valu
 | **`struct`**       | **`JSON object by field ID`**             | `{"1": 1, "2": "bar"}`                     | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format |
 | **`list`**         | **`JSON array of values`**                | `[1, 2, 3]`                                | Stores a JSON array of values that are serialized using this JSON single-value format |
 | **`map`**          | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format |
-
+| **`geometry`**     | **`JSON string`**                         |`00000000013FF00000000000003FF0000000000000`| Stores WKB as a hexideciamal string, see Appendix G |


RussellSpitzer · 2024-09-19T21:52:29Z

format/spec.md

+| **`struct`**         | Not supported                                                                                                |
+| **`list`**           | Not supported                                                                                                |
+| **`map`**            | Not supported                                                                                                |
+| **`geometry`**       | WKB format, see Appendix G                                                                                   |


RussellSpitzer · 2024-09-19T21:52:48Z

format/spec.md

+| **`timestamp_ns`**   | Stores nanoseconds from 1970-01-01 00:00:00.000000000 in an 8-byte little-endian long                        |
+| **`timestamptz_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC in an 8-byte little-endian long                    |
+| **`string`**         | UTF-8 bytes (without length)                                                                                 |
+| **`uuid`**           | 16-byte big-endian value, see example in Appendix B                                                          |


might as well fix this one too while we are at it :)

RussellSpitzer · 2024-09-19T21:55:46Z

format/spec.md

@@ -200,12 +200,15 @@ Supported primitive types are defined in the table below. Primitive types added
 |                  | **`uuid`**         | Universally unique identifiers                                           | Should use 16-byte fixed                         |
 |                  | **`fixed(L)`**     | Fixed-length byte array of length L                                      |                                                  |
 |                  | **`binary`**       | Arbitrary-length byte array                                              |                                                  |
+| [v3](#version-3) | **`geometry(C, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the geometry subclasses defined therein; crs C [4], edges E [5] | C and E are fixed, and if unset will take default values. |


I think maybe we should just link out for the requirements here since it's a bit complicated. Remove "an object"

The description as well could be

Simple feature geometry Appendix G, Parameterized by ....

I also don't think we should allow it to be unset ... can we just require that a subclass is always picked? We could recommend a set of defaults for engines to set on field creation but I'm not sure we need to be that opinionated here.

I think we can be more specific here and call out the standard that we are referencing, like we do with IEEE 754. This is "Geometry features as WKB(link) stored in coordinate reference system C and edge type E (see Appendix G)"

I would also say that "If not specified, C is OGC:CRS84 and E is planar".

RussellSpitzer · 2024-09-19T21:56:32Z

format/spec.md

+| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column                                                                                                                                                                                                                                                                                                                                                                   |
+| _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column                                                                                                                                                                                                                                                                                                                                                                    |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts                                                                                                                                                                                                          |
+| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file, and can be used for basic pruning only on geometry columns with planar edges.    |


Let's move these details out of the description and either into the footnotes ore another section for geometry.

rdblue · 2024-09-20T17:18:27Z

format/spec.md

-|            | _optional_ | **`135  equality_ids`**           | `list<136: int>`             | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
-| _optional_ | _optional_ | **`140  sort_order_id`**          | `int`                        | ID representing sort order for this file [3]. |
+| v1         | v2         | Field id, name                    | Type                         | Description                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| ---------- | ---------- |-----------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|


Could you remove the reformatting so we can more easily look at the changes?

format/spec.md

rdblue · 2024-09-20T17:21:21Z

format/spec.md

@@ -1084,14 +1100,16 @@ The 32-bit hash implementation is 32-bit Murmur3 hash, x86 variant, seeded with
 | **`uuid`**         | `hashBytes(uuidBytes(v))`		[4]      | `f79c3e09-677c-4bbd-a479-3f349cb785e7` ￫ `1488055340`               |
 | **`fixed(L)`**     | `hashBytes(v)`                            | `00 01 02 03` ￫ `-188683207`               |
 | **`binary`**       | `hashBytes(v)`                            | `00 01 02 03` ￫ `-188683207`               |
+| **`geometry`**     | `hashBytes(wkb(v))` [5]                   | `(1.0, 1.0)` ￫ `-246548298`              |


I would probably not specify how to hash geometry because we don't yet know how to do it correctly. The reason why we have the second table (hash requirements that are not part of bucket) is that we don't want anyone to forget that float and double should hash to the same value.

@szehon-ho, I don't think we should specify this or allow geometry in bucket transforms because of issues with equality.

format/spec.md

szehon-ho · 2024-09-22T20:05:17Z

Thanks @rdblue @RussellSpitzer addressed review comments.

remove auto formatting
added links to Appendix G
optimized storage of lower_bounds and upper_bounds (thanks @rdblue for suggestion to skip WKB here)
remove XZ2 partition transform (which we can add later, to unblock adding the type for V3)

szehon-ho · 2024-09-23T20:36:40Z

format/spec.md

@@ -1312,7 +1325,7 @@ This serialization scheme is for storing single values as individual binary valu
 | **`struct`**       | **`JSON object by field ID`**             | `{"1": 1, "2": "bar"}`                     | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format |
 | **`list`**         | **`JSON array of values`**                | `[1, 2, 3]`                                | Stores a JSON array of values that are serialized using this JSON single-value format |
 | **`map`**          | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format |
-
+| **`geometry`**     | **`JSON string`**                         |`00000000013FF00000000000003FF0000000000000`| Stores WKB as a hexideciamal string, see [Appendix G](#appendix-g-geospatial-notes)                                                                                                                                                                                                                      |


@rdblue am not entirely sure where this part of the spec is used. Should it also match the above (the more optimized serialization for stats)?

This is used for default values and for encoding values in JSON expressions for filtering.

What about using WKT here instead of WKB?

Good idea, added.

rdblue · 2024-09-30T19:51:23Z

format/spec.md

@@ -483,6 +485,7 @@ Notes:
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
 4. The following field ids are reserved on `data_file`: 141.
+5. For `geometry`, this a point composed of the min (lower_bound) or max (upper_bound) value of each dimension among all component points of all geometry objects for the file. These can be used for basic pruning only on geometry columns with planar edges. See Appendix D for encoding.


I think we need to be a little more specific here. The way I read this is that you can take min and max values for each dimension in the point, but that isn't sufficient for spherical edges.

I think this needs to state that the lower and upper bounds must be less than or equal (or greater than or equal) to the values of any point that is located on an edge of the geometry object. In other words, the bounding box must contain all points that are in the geometry object.

If we don't have that requirement, then there could be a point that is outside of the bounding box. If that's the case, then a query that includes the point may not overlap the bounding box and we cannot use it for filtering.

Per a previous conversation , it could be beneficial to have it in this form even for spherical edges. An engine could do some conversion from the bounding box to be useful for spherical edge.

Do you mean, you do not want this option at all ? (I suppose due to risk of mis-interpreting it)

rdblue · 2024-09-30T19:51:42Z

format/spec.md


+1. [https://github.com/apache/parquet-format/pull/240](https://github.com/apache/parquet-format/pull/240))


I'd prefer not to reference a PR.

Removed these, can add the logical types once the pr is merged in Parquet

rdblue · 2024-09-30T19:52:48Z

format/spec.md

@@ -1286,6 +1298,7 @@ This serialization scheme is for storing single values as individual binary valu
 | **`struct`**                 | Not supported                                                                                                |
 | **`list`**                   | Not supported                                                                                                |
 | **`map`**                    | Not supported                                                                                                |
+| **`geometry`**       | Always a single point, it is encoded in big-endian fashion as concatenated {x, y, optional z, optional m} values. |


Why big endian? Most of the time we use little endian in the format, with the only exception being the encoded decimal values.

Also, what is the encoding for these values? 8-byte IEEE 754?

Makes sense, added little-endian and 8-byte IEEE 754 (ie double type) for each coordinate

Fix rebase

szehon-ho · 2024-10-01T06:46:54Z

@rdblue thanks for further review. Would appreciate a clarification for this comment #10981 (comment), otherwise everything else is addressed.

Kontinuation · 2024-10-09T15:03:34Z

format/spec.md

@@ -1102,6 +1105,7 @@ Hash results are not dependent on decimal scale, which is part of the type, not
 4. UUIDs are encoded using big endian. The test UUID for the example above is: `f79c3e09-677c-4bbd-a479-3f349cb785e7`. This UUID encoded as a byte array is:
 `F7 9C 3E 09 67 7C 4B BD A4 79 3F 34 9C B7 85 E7`
 5. `doubleToLongBits` must give the IEEE 754 compliant bit representation of the double value. All `NaN` bit patterns must be canonicalized to `0x7ff8000000000000L`. Negative zero (`-0.0`) must be canonicalized to positive zero (`0.0`). Float hash values are the result of hashing the float cast to double to ensure that schema evolution does not change hash values if float types are promoted.
+6. WKB format, see [Appendix G](#appendix-g-geospatial-notes)


Missing hash specification for geometry primitive type in the table above. We should add a new row for geometry and annotate it with [6].

Kontinuation · 2024-10-09T15:17:09Z

format/spec.md

@@ -1286,6 +1291,7 @@ This serialization scheme is for storing single values as individual binary valu
 | **`struct`**                 | Not supported                                                                                                |
 | **`list`**                   | Not supported                                                                                                |
 | **`map`**                    | Not supported                                                                                                |
+| **`geometry`**               | A single point, encoded as a {x, y, optional z, optional m} concatenation of its 8-byte IEEE 754 values, little-endian. |


Is it always concatenated by 4 floating-point values? If it is not the case, we'll have a hard time figuring out if the point is in XYZ or XYM when there are 3 encoded dimensions. I suggest we use the WKB encoding of points here as well.

Enforcing the appearance of all 4 components and allow filling NaN for optional components also works, as it is more similar to the BoundingBox struct defined in the Parquet spec.

+1. This is why we have introduced a separate bounding box stats in the Parquet proposal to avoid the issue.

github-actions bot added the Specification Issues that may introduce spec changes. label Aug 20, 2024

szehon-ho force-pushed the geo_spec_draft branch 5 times, most recently from a096921 to 19f24a4 Compare August 20, 2024 20:47

szehon-ho commented Aug 20, 2024

View reviewed changes

szehon-ho force-pushed the geo_spec_draft branch from 19f24a4 to d7096e4 Compare August 20, 2024 20:59

rdblue reviewed Aug 20, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Aug 20, 2024

View reviewed changes

hsiang-c reviewed Aug 20, 2024

View reviewed changes

format/spec.md Show resolved Hide resolved

flyrain reviewed Aug 20, 2024

View reviewed changes

jiayuasu reviewed Aug 20, 2024

View reviewed changes

wgtmac reviewed Aug 21, 2024

View reviewed changes

Kontinuation reviewed Aug 21, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

wgtmac reviewed Aug 21, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

jiayuasu reviewed Aug 21, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

szehon-ho force-pushed the geo_spec_draft branch 2 times, most recently from 75326dc to 0591f68 Compare September 11, 2024 21:37

dmitrykoval reviewed Sep 18, 2024

View reviewed changes

wgtmac reviewed Sep 19, 2024

View reviewed changes

szehon-ho mentioned this pull request Sep 19, 2024

PARQUET-2471: Add geometry logical type apache/parquet-format#240

Open

github-actions bot added the spark label Sep 19, 2024

szehon-ho force-pushed the geo_spec_draft branch 2 times, most recently from b459eaf to 1ee5fad Compare September 19, 2024 07:10

Kontinuation mentioned this pull request Sep 19, 2024

Proof-of-concept Parquet GEOMETRY logical type implementation apache/arrow#43977

Open

RussellSpitzer reviewed Sep 19, 2024

View reviewed changes

rdblue reviewed Sep 20, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Sep 20, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Sep 20, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

szehon-ho commented Sep 23, 2024

View reviewed changes

nastra added this to the Iceberg V3 Spec milestone Sep 25, 2024

rdblue reviewed Sep 30, 2024

View reviewed changes

szehon-ho added 9 commits September 30, 2024 17:23

[Draft] Spec: Support geo type

bd0436c

Some review comments

f4a69eb

Review comments

f5e3a1d

Fix rebase

More fixes

2c29e47

Improve grammar

639675f

Add supported edge values

4412bae

Address offline review comments

13dc466

Russell and Ryan Review comments

11848a7

Ryan review comments

204dfdd

szehon-ho force-pushed the geo_spec_draft branch from 1e50f16 to 204dfdd Compare October 1, 2024 06:43

Kontinuation reviewed Oct 9, 2024

View reviewed changes

Kontinuation mentioned this pull request Oct 10, 2024

[WIP] API, Core: Proof of concept implementation of the geo support proposal #11293

Draft


		1. [https://github.com/apache/parquet-format/pull/240](https://github.com/apache/parquet-format/pull/240))

Spec: Support geo type #10981

Are you sure you want to change the base?

Spec: Support geo type #10981

Conversation

szehon-ho commented Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayuasu Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Kontinuation Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayuasu Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

szehon-ho Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayuasu Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

jiayuasu Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Aug 24, 2024 • edited Loading

szehon-ho commented Sep 11, 2024

Choose a reason for hiding this comment

szehon-ho Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

dmitrykoval Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Sep 22, 2024 • edited Loading

Choose a reason for hiding this comment

szehon-ho commented Aug 20, 2024 •

edited

Loading

szehon-ho Aug 20, 2024 •

edited

Loading

jiayuasu Aug 21, 2024 •

edited

Loading

Kontinuation Aug 21, 2024 •

edited

Loading

jiayuasu Aug 21, 2024 •

edited

Loading

szehon-ho Sep 11, 2024 •

edited

Loading

szehon-ho Aug 21, 2024 •

edited

Loading

jiayuasu Aug 21, 2024 •

edited

Loading

szehon-ho Sep 11, 2024 •

edited

Loading

jiayuasu Sep 13, 2024 •

edited

Loading

szehon-ho commented Aug 24, 2024 •

edited

Loading

szehon-ho Sep 18, 2024 •

edited

Loading

dmitrykoval Sep 18, 2024 •

edited

Loading

szehon-ho Sep 19, 2024 •

edited

Loading

RussellSpitzer Sep 19, 2024 •

edited

Loading

szehon-ho commented Sep 22, 2024 •

edited

Loading