PARQUET-2471: Add geometry logical type #240

wgtmac · 2024-05-10T14:56:04Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

jiayuasu · 2024-05-10T20:13:12Z

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

jiayuasu · 2024-05-10T20:15:13Z

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

wgtmac · 2024-05-11T01:23:58Z

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

szehon-ho

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift

pitrou · 2024-05-15T08:24:29Z

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

pitrou · 2024-05-15T08:36:13Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

paleolimbot

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift

wgtmac · 2024-05-17T15:46:24Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

wgtmac · 2024-05-17T15:54:38Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

paleolimbot · 2024-05-17T19:48:56Z

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

wgtmac · 2024-05-18T02:46:21Z

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Kontinuation · 2024-05-18T06:15:01Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

paleolimbot · 2024-05-20T02:43:39Z

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

paleolimbot · 2024-05-21T13:32:08Z

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

src/main/thrift/parquet.thrift

jorisvandenbossche · 2024-05-21T15:20:13Z

Just to ensure my understanding is correct:

This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

jorisvandenbossche · 2024-05-21T16:03:09Z

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

wgtmac · 2024-05-21T16:09:38Z

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

jornfranke · 2024-05-21T19:55:14Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

szehon-ho · 2024-05-21T21:14:39Z

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

cholmes · 2024-05-23T20:55:53Z

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

src/main/thrift/parquet.thrift

cholmes

Looks great! Added a few minor comments and +1'ed most everything @paleolimbot added. But great work on this all around.

cholmes · 2024-09-19T19:12:34Z

src/main/thrift/parquet.thrift

+ * between points represent a straight cartesian line or the shortest line on
+ * the sphere. It applies to all non-point geometry objects.
+ */
+enum Edges {


I'd say EdgeInterpolation is the better name than EdgeKind. +1 for keeping the attribute of the type the same as GeoParquet for consistency and simplicity, but makes sense to me that the enum name could be more descriptive.

cholmes · 2024-09-19T19:20:50Z

src/main/thrift/parquet.thrift

+  /**
+   * A type of covering. Currently accepted values: "WKB".
+   */
+  1: required string kind;


Is 'kind' a typical thing in Parquet? If not I'd call this 'type' - kind sounds weird to me, but not sure if 'type' as a very specific meaning for parquet. This is totally just a stylistic thing, so feel free to ignore. But note in the description we say 'a type of covering', not a 'the kind of covering'. Kind seems 'fuzzier', that it's wkb or close to it, while type seems to indicate just one to me.

I think there was a discussion on it but I cannot find it. I'm happy to switch kind to type if it looks better. @jiayuasu @paleolimbot WDYT?

I am fine with that but note we also need to change crs kind to be consistent.

Is this necessary? It is not in the Iceberg version so I'm wondering if it is something that is required for the initial release.

Uh, no, please don't. Parquet already has types (physical and logical). There is a reason for the use of another term here.

Is this necessary? It is not in the Iceberg version so I'm wondering if it is something that is required for the initial release.

Yes, it is not used by Iceberg. Covering was discussed in #240 (comment) to make it flexible enough to add more kinds of geospatial indexes without changing the Parquet spec in the future.

Parquet already has types (physical and logical). There is a reason for the use of another term here.

Sorry that I cannot find the original discussion. IIRC, Covering/type was proposed in the beginning and then switched to Covering/kind for similar reasons. So I'll keep using kind unless there is a better idea.

@rdblue For the WKB geometry encoding, we might introduce vectorized encoding for geometry in the future and allow both WKB and vectorized encoding co-exist. So we want to leave the door open.

@jiayuasu I think your last comment for the thread at #240 (comment)?

So I'll keep using kind unless there is a better idea.

Cool - sorry for the noise, I missed the original discussion to move it to kind.

I am fine with that but note we also need to change crs kind to be consistent.

If this is referencing to the encoding (WKT versus JSON, etc.), I really think that it needs to be renamed "CRS encoding". CRS kind means something else (geographic versus projected versus engineering versus parametric, etc.).

src/main/thrift/parquet.thrift

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

rdblue · 2024-09-20T17:35:41Z

src/main/thrift/parquet.thrift

+/**
+ * Physical type and encoding for the geometry type.
+ */
+enum GeometryEncoding {


Why is this an enum rather than being a requirement? This seems like we're being too generic to me.

I believe this is here to ensure we do not have to break compatibility if a newer or better encoding achieves the same level of ubiquitousness of WKB. The GeoParquet spec already has more than one encoding that we are experimenting with, and so it is not unreasonable that there may be reasons to evolve in the future (even if there are no plans to do so right now).

I don't think that anything should be here unless it has a clear and necessary use right now. We can always add new enums later, and we can also add new optional fields in the logical type struct. I would cut this to keep the spec simple.

GeometryEncoding is not something that we have to set for now as we have only one option. Perhaps we can simply add a comment to GEOMETRY type to enforce WKB. Later on an optional enum GeometryEncoding can be added once new encoding has been proposed officially.

I think this is OK as long as the next encoding doesn't use a ByteArray as storage (otherwise an older reader would have no way to error when it encounters the new encoding that it does not understand, if I'm understanding it correctly). I am not sure that is likely but such encodings do exist.

The next encoding can still use ByteArray. We can simply apply following rules once WKB is not the only encoding:

If GeometryEncoding is set, use the set encoding to interpret the binary data.

Otherwise, GeometryEncoding.WKB should be the only option to interpret the binary data.

Readers not checking if a GeometryEncoding is present will then still read wrong data in that scenario.

To me, it feels safer to simply keep this enum.

To workaround the legacy reader, we have to add separate geometry type if a new encoding has been introduced. From the discussion so far, it seems that we'd better keep enum GeometryEncoding.

rdblue · 2024-09-20T17:39:14Z

src/main/thrift/parquet.thrift

+   * [1] https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
+   * [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
+   */
+  3: optional list<i32> geometry_types;


What is this used for? Is it for some type of pushdown?

If you know in advance what all of the geometry types are (notably, if they are all the same, which is common), you can often choose a simpler or more performant code path, or provide a more informative error sooner. The declaration of a geometry type at a metadata level is very common when describing geospatial datasets.

Okay, can you give an example? The critical information is whether this needs to be in the spec for a purpose. Specs that give lots of freedom without specific guidance and requirements can create long-lasting problems.

A performance improvement is a good reason to have this kind of thing, but I'd like to understand more to make sure the requirements here are clearly stated.

A concrete example of performance would be reading a Parquet geometry column containing only points. A generic geometry column is generally read into memory as something like a list of abstract "geometry" objects, which can be a geometry of any type (e.g., a vector of JTS geometries or a GeoPandas GeoSeries). This has a high memory requirement and is inefficient for a number of things you might want to do with a huge number of points like build an index. A reader that knows in advance that there are only points can choose to decode the column into a vector of x values and y values.

For geospatial datasets this is almost always available to a caller inspecting a dataset...for geospatial practitioners, not knowing a geometry type is a little like not knowing if an integer is signed or unsigned (i.e., a very basic piece of information we need to know).

IMHO, the geometry_types attribute also is a trade-off to avoid adding a set of explicit subtypes like POINT, LINESTRING, POLYGON, etc. Another possible use case is that the application is able to quickly detect unexpected data (e.g. any non-polygon geometry) by checking geometry_types and decide if a certain function like ST_CONTAINS can be applied safely.

rdblue · 2024-09-20T17:41:34Z

src/main/thrift/parquet.thrift

+   * features like statistics or filter pushdown. Using a list of key-value pair
+   * provides maximum flexibility for adding future informative metadata.
+   */
+  5: optional list<KeyValue> key_value_metadata;


What is the use case for this?

The GeoParquet specification contains some concepts not covered here like orientation (describing whether polygons can be assumed to be correctly wound) and epoch (to better contextualize coordinates in something like WGS84, where continental movements might affect locations over time). The KeyValue list here lets the specification evolve without a change to Thrift (which would necessitate a new version of an implementation in most cases).

Why is this necessary for the definition of the logical type? Can these undocumented properties be part of the file's key/value metadata instead? And if these properties are important to the type definition, why are they undocumented key/value properties rather than defined in the spec?

Earlier review pushed the definition of the geometry type towards something that would be able to fully accommodate the GeoParquet specification without further modification of Thrift.. We could include the definition for epoch and orientation (which have been in two released versions of the GeoParquet specification, which is on its way to becoming an official OGC specification), or we could omit this for now.

why are they undocumented key/value properties rather than defined in the spec?

They are documented, but in the GeoParquet spec (until the last commit that changed this from a string to list<KeyValue> attribute, the comment linked to https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46, IMO we should add back that guidance that those key_value_metadata metadata are meant for that)

In general there is a tension between defining everything here (getting too much in geospatial-specific details for Parquet) and just referring to the GeoParquet spec for details and additional (optional) metadata (making the Parquet spec less complete and self-describing).

Throughout the discussion, we have gone back and forth on this. Initially, everything was included here, but there was a desire to remove as much as possible the geo-specific metadata. And then later only the most essential pieces of metadata were added back (encoding, crs, geometry_types, edges)

There are metadata which may be unimportant to Parquet but important to GeoParquet. As we can see that GeoParquet is still fast-evolving, we do not want to modify the thrift frequently to adopt new proposal from GeoParquet every time. Metadata fields like crs and edges are required to interpret the data so we have added them. We can of course add more explicit metadata in the future when they are required to interpret the data.

Just a note that edges probably has to be available to the Parquet implementation because it affects how (or whether) to push down a spatial filter. I would prefer to keep the CRS in thrift because it would mean that something like the C++ implementation would have to parse JSON to pass on the CRS to the Arrow type (which is possible, but ugly).

Sorry for not making it clear. Just edited my previous response.

They are documented, but in the GeoParquet spec (until the last commit that changed this from a string to list attribute, the comment linked to https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46, IMO we should add back that guidance that those key_value_metadata metadata are meant for that)

Good suggestion! I've added back the comment.

rdblue · 2024-09-20T17:46:29Z

src/main/thrift/parquet.thrift

@@ -1084,6 +1290,9 @@ struct ColumnIndex {
    * Same as repetition_level_histograms except for definitions levels.
    **/
   7: optional list<i64> definition_level_histograms;
+
+   /** A list containing statistics of GEOMETRY logical type for each page */
+   8: optional list<GeometryStatistics> geometry_stats;


Why are there stats for each page? Each bbox is up to 64 bytes, which seems like a lot of overhead at the page level, especially given that WKB objects are also considerably larger than most values stored in a Parquet page.

All fields of GeometryStatistics are optional and the geometry_stats field itself is optional. Isn't it better to provide freedom to writer implementation to turn on features they need?

I think that it is better to give guidance and keep the spec small. I think we should only add things that have a clear use right now.

I can't speak to data pages since I am not familiar with that level of the specification; however, these are absolutely essential at the column chunk level. I will say that even for very small objects, knowing the bounding box is typically worth it (e.g., nearly all spatial formats cache this information for every single geometry object). This is because many geometry operations, particularly with polygons, are incredibly expensive and can often be skipped for features that don't intersect.

Page level stats are super useful in a needle-in-the-haystack search. Computation on geometry type can be very slow due to its mathematical complexity. Page-level stats such as bounding box can help filtering out unnecessary pages because computation on bounding box is faster in order of magnitude than on complex polygons.

Just to unify the conversation. A page typically holds O(10000s of records) or O(Multiple KBs to MBs of data). It sounds like the at least some cases it might be worth it. I think to @rdblue point, we might want to evaluate additional file size this data adds to make sure estimates are accurate for some geography data and if it is not worth writing by default we should consider removing it.

mkaravel · 2024-09-20T18:00:37Z

src/main/thrift/parquet.thrift

+ * Interpretation for edges of elements of a GEOMETRY logical type. In other
+ * words, whether a point between two vertices should be interpolated in
+ * its XY dimensions as if it were a Cartesian line connecting the two
+ * vertices (planar) or the shortest spherical arc between the longitude


Suggested change

* vertices (planar) or the shortest spherical arc between the longitude

* vertices (planar) or the shortest geodesic arc between the longitude

Whatever wording we choose here must make clear that the interpolation assumes a sphere (not an ellipsoid). Even though an ellipsoid is a better approximation of the surface, there are no geometry engines we know about that are interested in storing Parquet files that define edges in this way. Notably, BigQuery geography and Snowflake geography types define edges in this way.

mkaravel · 2024-09-20T18:00:58Z

src/main/thrift/parquet.thrift

+ * words, whether a point between two vertices should be interpolated in
+ * its XY dimensions as if it were a Cartesian line connecting the two
+ * vertices (planar) or the shortest spherical arc between the longitude
+ * and latitude represented by the two vertices (spherical). This value


Suggested change

* and latitude represented by the two vertices (spherical). This value

* and latitude represented by the two vertices (spherical or spheroidal). This value

Again, we very specifically are approximating this interpolation using a sphere (perhaps there is more clear language to clarify that!)

mkaravel · 2024-09-20T18:01:18Z

src/main/thrift/parquet.thrift

+ * coordinate reference system.
+ *
+ * Because most systems currently assume planar edges and do not support
+ * spherical edges, planar should be used as the default value.


Suggested change

* spherical edges, planar should be used as the default value.

* spherical or spheroidal edges, planar should be used as the default value.

mkaravel · 2024-09-20T18:01:48Z

src/main/thrift/parquet.thrift

+ */
+enum EdgeInterpolation {
+  PLANAR = 0;
+  SPHERICAL = 1;


Suggested change

SPHERICAL = 1;

SPHERICAL = 1;

SPHEROIDAL = 2;

To be honest I am not sure how this enum is useful. As @desruisseaux mentioned, this information can be deciphered from the CRS. Why do we have it?

There are several threads here that discuss this, but briefly, we need to ensure that the intent of mainstream producers that (1) have a geometry type and (2) export Parquet files can be expressed. BigQuery and Snowflake geography types approximate connecting two vertices using an arc assuming a sphere; all other geometry types connect vertices with a straight line regardless of the CRS.

My understanding is that PLANAR and SPHERICAL are already adopted by engines (e.g. BigQuery and Snowflake mentioned above) and standards (e.g. GeoParquet) which support Parquet file format. It is clear that we should support PLANAR and SPHERICAL at the moment. Since it may take some time to discuss topics on SPHEROIDAL, perhaps we can add it as a follow-up work item?

I think that the confusion may come from the name of the enumeration values. If I'm understanding right, the intend of PLANAR is "shorted path between two points on a plane". Problems:

It is not the only way to connect two points on a plane. We may also want Bézier curves, splines, etc.

PLANE is a bit restrictive as it seems to exclude the 3D space.

What about renaming PLANE as STRAIGHT_LINE? It leaves room for other interpolation methods on the plane in the future if desired. It also adds real value compared to just saying that the coordinate system is Cartesian, which can already be inferred from the CRS.

Likewise, SPHERICAL and SPHEROIDAL could be renamed as GEODESIC_LINE. It leaves room for different interpolation methods in the future, e.g. with smooth changes of headings instead of using the shortest paths. This is equivalent, in the plane, to Spline curves featuring smooth changes of derivatives instead of sudden changes of direction at every points.

I saw the discussion about SPHERICAL meaning the use of spherical formulas even if the CRS is ellipsoidal. But I'm not sure how it would work in practice. Which sphere radius should the computation use? The authalic sphere? Something else?

I'm not an expert here but this language seems consistent with GeoParquet. Was this terminology already discussed for that?

mkaravel · 2024-09-20T18:23:33Z

src/main/thrift/parquet.thrift

+ * A custom binary-encoded polygon or multi-polygon to represent a covering of
+ * geometries. For example, it may be a bounding box or an envelope of geometries


I am not sure I fully understand this statement.

Let's say we have a single line table and all it contains is LINESTRING(5 10,15 10) in WGS84. What would be the bounding box for this? What would be the polygon for this?

If we think of the bounding box as a Cartesian product of longitude x latitude values, then the box should be (approximately) [5,15] x [10,10.03769]. Now if you represent it as a polygon this would be something like (assuming CW orientation) POLYGON((5 10,15 10,15 10.03769,5 10.03769)). Is this to be understood as a Cartesian polygon or a a polygon in WGS84? If the latter then it refers to something completely different. If the intent is to represent a polygon in WGS84 the only thing that comes to mind is POLYGON((5 10,15 10,5 10)) which I am not sure is a valid representation of a polygon. The other option would be POLYGON((5 10,15 10,x 10.03769,5 10)) where x is the longitude value for which the latitude on the single segment of the linestring is 10.03769.

I understand that the keyword here is "custom" which is what leaves a lot of room for whatever one would like to implement. Maybe the only thing to change here is the phrasing. My main point is that bounding box and polygon are not the same things in geographic coordinate systems.

Great point that there is room for improvement on the phrasing here!

I believe the intent is that this can be any polygon that completely covers the values that it is representing such that for all the values it is representing, both st_intersects(arbitrary_geometry, covering) is guaranteed to be true if st_intersects(arbitrary_geometry, value) is true. One easy way to generate this is to take the bounding box (as defined here) and return its vertices as a polygon. Your example is a horizontal line (in Cartesian space, which it could be defined as if the EdgeInterpolation was set to PLANAR), and so this would be a degenerate Polygon (but could still be defined). For spherical edges, one could compute a discrete global grid covering (e.g., S2 or H3) and convert the boundary of that to a polygon.

mkaravel · 2024-09-20T18:25:22Z

src/main/thrift/parquet.thrift

+   *   covers the contents. This will be interpreted according to the same CRS
+   *   and edges defined by the logical type.


Okay, this answers one of my questions above, whether the polygon or multipolygon is in the CRS as the input geometry. See my other comment. I do not think it is as simple as it looks.

mkaravel · 2024-09-20T18:30:46Z

src/main/thrift/parquet.thrift

+ * but it is recommended that the writer always generates bounding box statistics,
+ * regardless of whether the geometries are planar or spherical.
+ */
+struct BoundingBox {


I am mostly aligned with @desruisseaux on this one. The one catch here is that if you have a longitude start value of 60° and a longitude end value of -60° then "going to +infinity, then wrapping to -infinity" approach produces a longitude range that is more than 180° and I believe this is no longer a geodesic arc on the sphere/spheroid, no matter what the latitude values are (poles excluded).

rdblue · 2024-09-23T23:26:03Z

src/main/thrift/parquet.thrift

+   * Encoding used in the above crs field. It MUST be set if crs field is set.
+   * Currently the only allowed value is "PROJJSON".
+   */
+  4: optional string crs_encoding;


In terms of defining the type, I don't think that this encoding is relevant. The type should reference a CRS, but it is not the type's job to pass around the CRS definition or to support multiple ways to encode it (or in this case, contemplate that there could be other ways to encode it).

I would prefer passing CRS as an identifier string (which is what we're mostly agreed on in Iceberg) and adding ways to pass the CRS definition either in file metadata or other ways.

The ability to include a parameterized CRS is absolutely essential for the GEOMETRY type in Parquet to be useful: not all CRSes have been catalogued, and many can't be because they're too specific (e.g., a CRS optimized for a small locality or specific project, or the view of a satellite orbiting a planet) or too old (e.g., one of my projects with the Canadian government digitizing several decades of sea ice coverage where the first four decades were in a CRS that had never been catalogued but could be expressed in PROJJSON).

The crs_encoding piece is to make the crs string unambiguous. I happen to think this is an improvement over many existing systems that just provide a string and force the reader to guess the intent; however, it is not strictly necessary (e.g., we could just define the CRS as a string).

Iceberg has a different set of use cases to Parquet...Parquet is useful to geospatial practitioners operating at a smaller scale that need to deal with these issues and want to use Parquet to do so. It may be that an identifier-based format may fit the Iceberg use case well.

The one catch here is that if you have a longitude start value of 60° and a longitude end value of -60° then "going to +infinity, then wrapping to -infinity" approach produces a longitude range that is more than 180°

Right, in my discussion about allowing "min" > "max" in a bounding box, I forget to specify that doing a wraparound at infinity works for intersection and union calculations, or generally for everything that involves the <, > and = operations. It does not work for arithmetic. But if the purpose of the bounding boxes is fast searches (indexing), unions and intersections are all we need, aren't there?

* Update the spec according to the new feedback * Fix typo

jiayuasu · 2024-10-05T15:53:10Z

I've updated the PR according to the feedback from several folks.

Rename EdgeInterpolation back to Edges since the explanation in comments already made it clear and we don't have to use the long name any more.
Removed the Covering statistics since it is not really useful at the moment. Neither C++ and Java POC implement it.
Adopted the westmost and eastmost representation of BBox when edges = spherical.
Offloaded the CRS representation to Parquet file metadata fields such that multiple geometry columns can refer to the same CRS. This also makes sure that the Parquet spec does not rely on another spec for CRS.
Removed the optional list<KeyValue> key_value_metadata. This makes sure that the geometry column definition is clear.

With this PR in place, the Parquet Geometry PR is nearly identical to the Iceberg Geometry PR, with the following exceptions:

The Iceberg Geometry PR uses lower_bounds and upper_bounds instead of BBox statistics. But it could easily adopt westmost and eastmost representation by updating the spec explanation.
The Iceberg Geometry PR offloads CRS to table properties while Parquet Geometry PR offloads it to file metadata.

If additional geometry column field like orientation is needed, we can add it to both Iceberg and Parquet Geometry PR but we should not allow list for arbitrary fields.

jiayuasu · 2024-10-05T15:54:29Z

@paleolimbot @rdblue Would you please review the PR again and let us know this works for you guys?

paleolimbot

Rename EdgeInterpolation back to Edges since the explanation in comments already made it clear and we don't have to use the long name any more.

+1 (I care that spherical edges can be represented and that it's clear what that means, but the name doesn't matter much to me)

Removed the Covering statistics since it is not really useful at the moment. Neither C++ and Java POC implement it.

+1 (I do think that Covering or at least a better option than a bounding box is a good idea, but I agree that it makes sense to defer this to a time when two implementations actually support it)

Adopted the westmost and eastmost representation of BBox when edges = spherical.

+1, although I left a note because it is not clear to me whether the language you include means that the values are still Cartesian-ish min/max of coordinates (easier for writers, not that useful for readers) or a bounding box taking into account the curvature of any edges (more computationally expensive for writers, easy for readers).

Offloaded the CRS representation to Parquet file metadata fields such that multiple geometry columns can refer to the same CRS. This also makes sure that the Parquet spec does not rely on another spec for CRS.

This seems like a very strange way to parameterize the CRS to me that doesn't simplify the specification (e.g., we still have to talk about all the same issues but they are confusingly shoved to the side). I don't think I have anything to add here that hasn't already been hashed out on the relevant thread...the language before your last changed allowed for omitting CRS if Iceberg (or somebody else) doesn't want to deal with that, and we can attempt a follow-up change to allow other encodings, perhaps with specific examples/evidence of how it is not possible/suboptimal/difficult to follow the spec.

Removed the optional list key_value_metadata. This makes sure that the geometry column definition is clear.

+1 from me here (if further values need to be included we can make a PR to discuss whether they need to or should be added).

paleolimbot · 2024-10-06T20:04:08Z

src/main/thrift/parquet.thrift

+ * [westmost, eastmost, southmost, northmost], with necessary min/max values for
+ * Z and M if needed.


Suggested change

* [westmost, eastmost, southmost, northmost], with necessary min/max values for

* Z and M if needed.

* [westmost, eastmost, southmost, northmost], with necessary min/max values for

* Z and M if needed. The bounding box is always interpreted as if it had planar edges,

* even when edges is non-planar.

Alternatively you could state that it's the min/max value pair of coordinates from each axis even when Edges is spherical if that is what you meant here (reading northmost would lead me to assume that the bounding box here takes into account the curvature of any edges). This involves some extra effort for writers but makes it possible for readers to do filtering with reasonably simple math even for spherical edges.

On the "Edges" name

Rename EdgeInterpolation back to Edges since the explanation in comments already made it clear and we don't have to use the long name any more.

+1 (I care that spherical edges can be represented and that it's clear what that means, but the name doesn't matter much to me)

I think that the "edges" name is quite problematic. Such simple name assumes that no other information about edges will never be added. What if a future version wants to add information about edge interpolation accuracy, or the time period of each edge (as in moving features), etc?

I think we should reserve "edges" for the future. If not for edges data, it may be for something similar to CSS, where border is a shortcut for other properties such as border-style, border-width, etc.

Another reason is that developers rarely read documentation until they have no choice. For someone who see the data without having read the documentation, "edges: planar" is highly confusing. Even peoples who read the documentation may not remember all details.

EdgeInterpolation, InterpolationMethod or simply Interpolation would also be more consistent with other OGC standards such as OGC 16-140 — Moving Features, which already uses "interpolation". See in particular "Predefined Interpolation Methods" below section 6.3.1.

On the "planar" and "spherical" names

As said before, I think that planar and spherical doesn't mean anything in the context of interpolations. For the "planar" case, see instead above-cited Moving features specification: they already provide discrete, stepwise, linear and spline interpolation methods. The two first ones may not be applicable to 2D space, but linear and spline are applicable.

What would your ideal language be to describe a WKB content in a column annotated with this GEOMETRY type that originated from BigQuery or Snowflake Geography?

A note that the term "interpolation" (and some of the other ideas in this definition) comes from here:

https://github.com/google/s2geometry/blob/ca1f3416f1b9907b6806f1be228842c9518d96df/src/s2/s2projections.h#L57-L72

(and also GEOS (and PostGIS/shapely/sf/... based on it) uses the "interpolate" term constructing a point on a line segment)

gszadovszky · 2024-10-08T11:12:15Z

I don't have experience with geometry data so my review is more general.

The actual specification of the logical types are currently placed in LogicalTypes.md. It is always good to have proper description for the thrift objects, but I think the broader specification should be placed separately.
(If the geometry related documentation grows big enough, we should even create a separate doc for it, but for now LogicalTypes.md should be fine.)

emkornfield · 2024-10-09T07:19:44Z

src/main/thrift/parquet.thrift

+  4: required double ymax;
+  5: optional double zmin;
+  6: optional double zmax;
+  7: optional double mmin;


nit: could we add docs for m? As someone not too immersed with geography, I'm not exactly sure what is meant by them.

emkornfield · 2024-10-09T07:21:43Z

src/main/thrift/parquet.thrift

+ */
+struct BoundingBox {
+  /** Westmost value if edges = spherical **/
+  1: required double xmin;


small nit (maybe not necessary) but considering using x_min, x_max, etc I'd need to review the file if there is any prior art for consistency. I guess the values here are consistent with geoparquet spec?

emkornfield · 2024-10-09T07:23:35Z

src/main/thrift/parquet.thrift

+  2: required Edges edges;
+  /**
+   * CRS (coordinate reference system) is a mapping of how coordinates refer to
+   * precise locations on earth. A crs is specified by a string, which is a Parquet


Suggested change

* precise locations on earth. A crs is specified by a string, which is a Parquet

* precise locations on earth. A CRS is specified by a string, which is a Parquet

emkornfield · 2024-10-09T07:23:47Z

src/main/thrift/parquet.thrift

+  /**
+   * CRS (coordinate reference system) is a mapping of how coordinates refer to
+   * precise locations on earth. A crs is specified by a string, which is a Parquet
+   * file metadata field whose value is the crs representation. An additional field


Suggested change

* file metadata field whose value is the crs representation. An additional field

* file metadata field whose value is the CRS representation. An additional field

emkornfield

I'm not an expert here. I left a few comments it would be nice to make sure for the open threads where there was some disagreement that we have relative consensus, and hopefully we can push this forward.

desruisseaux · 2024-10-09T12:15:02Z

Created two pull requests:

Replace the Edges enumeration values: wgtmac/parquet-format#6 for the renaming of edges enumeration values.
CRS encoding and permutation wgtmac/parquet-format#7 for the renaming of CRS type field as CRS encoding, and the addition of a permutation field.

wgtmac force-pushed the geo branch from 4d36df9 to ad29afd Compare May 10, 2024 15:01

szehon-ho reviewed May 11, 2024

View reviewed changes

wgtmac marked this pull request as ready for review May 11, 2024 16:13

wgtmac changed the title ~~WIP: Add geometry logical type~~ PARQUET-2471: Add geometry logical type May 11, 2024

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot reviewed May 15, 2024

View reviewed changes

paleolimbot mentioned this pull request May 21, 2024

Thoughts about a first-class GEOMETRY data type in Parquet? opengeospatial/geoparquet#222

Open

jorisvandenbossche reviewed May 21, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot mentioned this pull request May 21, 2024

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader apache/arrow#41764

Open

WIP: Add geometry logical type

f7b956b

pitrou reviewed Sep 19, 2024

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

pitrou reviewed Sep 19, 2024

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

pitrou reviewed Sep 19, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

kylebarron mentioned this pull request Sep 19, 2024

Update CRS section of metadata specification geoarrow/geoarrow#40

Open

cholmes approved these changes Sep 19, 2024

View reviewed changes

kaushiksrini mentioned this pull request Sep 20, 2024

ENH: Geometry native type to Iceberg connector trinodb/trino#23512

Open

wgtmac and others added 5 commits September 20, 2024 23:08

Update src/main/thrift/parquet.thrift

89bfac7

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

d98cf61

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

2a3524f

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

3e54c7e

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

address feedback about edges and wkb

3961cbf

rdblue reviewed Sep 20, 2024

View reviewed changes

mkaravel reviewed Sep 20, 2024

View reviewed changes

rdblue reviewed Sep 23, 2024

View reviewed changes

wgtmac mentioned this pull request Sep 25, 2024

[WIP] Add support for geometry type apache/orc#2032

Draft

wgtmac and others added 2 commits September 27, 2024 09:48

add geoparquet column metadata back

98c3589

Update the spec according to the new feedback (#5)

c134c91

* Update the spec according to the new feedback * Fix typo

paleolimbot reviewed Oct 6, 2024

View reviewed changes

emkornfield reviewed Oct 9, 2024

View reviewed changes

	* vertices (planar) or the shortest spherical arc between the longitude
	* vertices (planar) or the shortest geodesic arc between the longitude

	* and latitude represented by the two vertices (spherical). This value
	* and latitude represented by the two vertices (spherical or spheroidal). This value

	* spherical edges, planar should be used as the default value.
	* spherical or spheroidal edges, planar should be used as the default value.

		* A custom binary-encoded polygon or multi-polygon to represent a covering of
		* geometries. For example, it may be a bounding box or an envelope of geometries

		* covers the contents. This will be interpreted according to the same CRS
		* and edges defined by the logical type.

		* [westmost, eastmost, southmost, northmost], with necessary min/max values for
		* Z and M if needed.

	* precise locations on earth. A crs is specified by a string, which is a Parquet
	* precise locations on earth. A CRS is specified by a string, which is a Parquet

	* file metadata field whose value is the crs representation. An additional field
	* file metadata field whose value is the CRS representation. An additional field

PARQUET-2471: Add geometry logical type #240

Are you sure you want to change the base?

PARQUET-2471: Add geometry logical type #240

Conversation

wgtmac commented May 10, 2024

jiayuasu commented May 10, 2024

jiayuasu commented May 10, 2024

wgtmac commented May 11, 2024

szehon-ho left a comment • edited Loading

Choose a reason for hiding this comment

pitrou commented May 15, 2024 • edited Loading

pitrou commented May 15, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

wgtmac commented May 17, 2024 • edited Loading

wgtmac commented May 17, 2024

paleolimbot commented May 17, 2024

wgtmac commented May 18, 2024

Kontinuation commented May 18, 2024

paleolimbot commented May 20, 2024

paleolimbot commented May 21, 2024

jorisvandenbossche commented May 21, 2024

jorisvandenbossche commented May 21, 2024

wgtmac commented May 21, 2024

jornfranke commented May 21, 2024 • edited Loading

szehon-ho commented May 21, 2024

cholmes commented May 23, 2024

cholmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

wgtmac commented May 17, 2024 •

edited

Loading

jornfranke commented May 21, 2024 •

edited

Loading

wgtmac Sep 24, 2024 •

edited

Loading

paleolimbot Sep 24, 2024 •

edited

Loading