Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geospatial Support #10260

Open
1 of 6 tasks
szehon-ho opened this issue May 2, 2024 · 21 comments
Open
1 of 6 tasks

Geospatial Support #10260

szehon-ho opened this issue May 2, 2024 · 21 comments
Labels
proposal Iceberg Improvement Proposal (spec/major changes/etc)

Comments

@szehon-ho
Copy link
Collaborator

szehon-ho commented May 2, 2024

Proposed Change

(This is an abridged version of the proposal document)

Big data open source projects have been leveraged for storage and analysis of geospatial data for a long time, and a flourishing ecosystem has evolved. Examples are GeoParquet for Parquet, Sedona for Spark, GeoMesa for HBase and Cassandra, and in-development or completed native support in Hive and Trino. Given the central position of Apache Iceberg table format in the stack, it would be great to natively support geospatial support as well.

There have been implementations of geospatial support in Iceberg (Geolake and Havasu) which have promising results. Unfortunately as Iceberg lacks Extension points, these have been in the form of forks of the project. It would be great to leverage the efforts and findings of these projects in adding native support to Iceberg.

This will add the following to the Iceberg project:

  • Geospatial types (ex, point, linestring, polygon)
  • Geospatial expressions (st_covers, st_covered_by, st_intersects)
  • Geospatial partition transforms (XZ2)
  • Geospatial sort (hilbert)
  • Spark integration support

This will allow the following use cases:

  • Create a table with geospatial type
    CREATE TABLE geom_table (geom GEOMETRY);
  • Insert geospatial data
    INSERT INTO geom_table VALUES ('POINT(1 2)', 'LINESTRING(1 2, 3 4)')
  • Query using geospatial predicates:
    SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5))
  • Define a geospatial partition transform to allow partition filtering for geospatial query
    ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom))
  • Rewrite using geospatial sort order to allow file and row-group filtering for geospatial query
    CALL rewrite_data_files(table => `geom_table`, sort_order => `hilbert(geom)`)

Proposal document

https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI

Specifications

  • Table
  • View
  • REST
  • Puffin
  • Encryption
  • Other
@szehon-ho szehon-ho added the proposal Iceberg Improvement Proposal (spec/major changes/etc) label May 2, 2024
@szehon-ho
Copy link
Collaborator Author

szehon-ho commented May 2, 2024

Note: special thanks to @jiayuasu and @Kontinuation from Wherobots for invaluable domain specific advice and POC support from Havasu Iceberg-fork and Geolake, and also @badbye and other members of Geolake for support.

Also thanks @aokolnychyi and @hsiang-c for reviewing locally.

@jiayuasu
Copy link
Member

jiayuasu commented May 2, 2024

Looking forward to the feedback from Iceberg community!

@dmeaux
Copy link

dmeaux commented Jul 23, 2024

Hi,

I work at Geomatys. We are interested in contributing to this effort, including bringing our 20+ years of experience and expertise from developing Apache SIS and from working on OGC's WKT-CRS and GeoAPI standards amongst many others in not only the vector domain but the raster, sensor, GeoDataCube, Discrete Global Grid Systems, and spatial indexing domains as well.

@szehon-ho
Copy link
Collaborator Author

szehon-ho commented Jul 23, 2024

Hi @dmeaux thanks for the note! Look forward if you have comments on the proposal and working together on this.

Currently I believe this is seeing if the necessary support in apache/parquet-format#240 can make progress, which @jiayuasu and his team at Wherobots are helping on with Parquet community (huge thanks to @wgtmac for driving the effort). This was discussed briefly in last sync (cc @rdblue ), should have updated it here.

@dmeaux
Copy link

dmeaux commented Jul 23, 2024

@szehon-ho, you should see comments from @desruisseaux, our CRS and metadata expert (among many other things). For your background, he sits on several OGC standards committees and is one of the primary maintainers of Apache SIS.

@desruisseaux
Copy link

To summarize:

Next to the "Coordinate Reference System, ie mapping of how coordinates refer to precise locations on earth" sentences, it may be worth to said that OGC:CRS84 has an accuracy of 2 meters for avoiding the impression that "precise location" means "unlimited" accuracy.

I would not recommend PROJJSON, unless there is a requirement to use JSON in text fields. PROJJSON is not a standard, it is specific to one particular project. A standard CRS JSON encoding is planed, but may not be ready before 2 years. In the meantime, the most widely supported CRS encoding is ISO 19162 (a.k.a. "WKT 2"). The latter is supported by PROJ, Apache SIS and ESRI software, to name only the ones that I know.

JTS is a popular library, but I would recommend to nevertheless keep a degree of freedom for allowing the co-existence of different libraries. Because:

  • JTS is designed for Cartesian coordinate systems, while "GeoIceberg" phase 1 uses an ellipsoidal coordinate system (OGC:CRS84). Ignoring that fact and using JTS with latitudes and longitudes anyway is a common practice, but risky. Calculations (including interpolations) may look plausible on screen, but with higher risk of being at odd with reality compared to using the right library with the right coordinate system. Even ST_Intersects can be wrong near the edges.
  • The need to have 2 coexisting libraries is real. For example, JTS for projected CRS and Google S2 for geographic CRS.
  • Alternative libraries such as ESRI geometry API should not be neglected given ESRI importance in governments.
  • Every software has bugs. We also use JTS for most of our work on our side, but nevertheless sometimes need to switch to another library for some calculations where JTS have issues. Hence, the possibility to use Apache SIS with either JTS, ESRI or Java2D (still useful for Bezier curves), at user's choice.
  • OGC Moving Features is becoming more and more important. For example, MobilityDB is a PostGIS extension for moving features. In this context, geometries like "line string" become "trajectory" with new set of operations. JTS can store z and m coordinates, but the handling of time in a moving feature is more than just storing the coordinate values. Maybe JTS will support Moving Feature operations in a future version, or maybe not.

@jiayuasu
Copy link
Member

jiayuasu commented Jul 26, 2024

@desruisseaux Thanks for the great suggestion.

The status of this PR is that wait until the Parquet format accepts the geometry type (mostly by absorbing GeoParquet into the Parquet Geometry type). More detail can be found here: apache/parquet-format#240

Iceberg community also has concerns related to PROJJSON, mostly because it is the only library that can handle it and no Java alternative for it. However, considering this is an extremely controversial topic and the community can debate on this forever, @szehon-ho and I want to make the CRS as a string field (same in the Parquet Geometry type). One can put WKT2 CRS, SRID, PROJJSON, CRSJSON in this string value. It is the reader / writer's responsibility to figure out what the string is. Does this make sense? In GeoIceberg Phase1, we will hard-code it to a value OGC:CRS84.

Regarding JTS and Google S2, what you said makes sense. We can implement a GeoLib-agnostic interface to accomodate this. Can you take a look at the Parquet Geometry proposal and comment on that?

@desruisseaux
Copy link

A CRS as a string field is fine. I suggest to limit the allowed formats to the following:

  • WKT 2 as defined by ISO 19162
  • SRID in the following forms:
    • HTTP URL such as http://www​.opengis​.net/def/crs/epsg/0/4326
    • URN such as urn:ogc:def:crs:EPSG::4326
    • Maybe AUTHORITY:CODE as a shortcut for above URN with implicit urn:ogc:def:crs: prefix and no version number.
  • If Iceberg wants to handle 3D or 4D data, maybe allow compound SRID. Example for (latitude, longitude) with height above Mean Sea Level:
    • urn:ogc:def:crs,crs:EPSG::4326,crs:EPSG::5714
    • http://www​.opengis​.net/def/crs-compound?1=http://www​.opengis​.net/def/crs/epsg/0/4326&2=http://www​.opengis​.net/def/crs/epsg/0/5714

I suggest to avoid PROJJSON for now. It is not a standard, and if a slightly different CRS JSON standard is added later, allowing the 2 formats in the same field may create ambiguities. This is a risk that the OGC CRS working group will try to avoid, but it is yet safer to not add PROJJSON too soon. Some issues with PROJJSON are:

  • Its model is based on a mix of WKT 2 and ISO 19111, which means that PROJJSON inherits some of the compromise done in WKT 2 for historical reasons. As a new format, it would have been nice to derive it more directly from the ISO 19111 model, unless there is good reasons for departure.
  • CRS definition contains at least 3 elements inherited from the ISO 19115 metadata standard: Citation, Extent and PositionalAccuracy. The encoding of those elements in JSON is being defined by ISO and will be the ISO 19115-4 international standard. A CRS JSON standard should not invent a different encoding. If we (implementers) have to write a parser for ISO 19115-4, we want to use the same parser for at least the metadata elements inside a CRS JSON.

@jiayuasu
Copy link
Member

jiayuasu commented Jul 28, 2024

@desruisseaux Great. Can you also explain that what are the OSS libraries available to parse these CRS formats? Ideally, we are looking for options in both C, Java, and Python.

In addition, how does one can tell the string is in a certain CRS format (maybe by reading the first few characters, or try catch exception handling )?

@desruisseaux
Copy link

Citing only the libraries that I know (more may be available):

  • C/C++
    • PROJ can parse all the above.
    • ESRI prototype can parse WKT 2. It was a prototype created by ESRI during the development of the WKT 2 standard, but is under Apache 2 license and can be probably be reused if desired.
  • In Java:
    • Apache SIS can parse all the above, with the canvas that SIS 1.4 parses a slightly older version of WKT 2 (ISO 19162:2015 instead of ISO 19162:2019). Upgrade to the latest ISO 19162 standard is in progress right now and will be part of SIS 1.5 release in one or two months.
    • GeoTools can parse WKT 1 and can handle AUTHORITY:CODE syntax. I did not mentioned WKT 1 in the list of allowed formats because it is potentially ambiguous. But for CRS where the units of measurement is restricted to degrees and meters and the map projection is restricted to a short list defined by OGC 01-009, it may be acceptable. The specification should explicitly disallow WKT 1 with other units of measurement for avoiding ambiguity.
    • PROJ4J may have some WKT 1 parsing capability and AUTHORITY:CODE support as well (I did not verified closely).
  • In Python:
    • I think that a large part of the community uses a binding to PROJ.
    • Otherwise, PyCRS is pure Python and have WKT 1 support.

Caution about axis order when using authority code

When using SRID, axis order shall be as defined by the authority. It means that EPSG:4326 shall be (latitude, longitude), not (longitude, latitude). I know that a lot of developers hate that, but this rule should be strictly enforced if we do not want to cause again the confusion that existed for years before OGC decided to clarify this policy. It does not mean that we cannot use (longitude, latitude). It only means that if it is (longitude, latitude), don't call it EPSG:4326. Use another name, for example OGC:CRS84, or use WKT where axis order can be specified.

@desruisseaux
Copy link

In addition, how does one can tell the string is in a certain CRS format (maybe by reading the first few characters, or try catch exception handling )?

For WKT 1 (if allowed) versus WKT 2, the library should be able to distinguish by itself, because the keywords are not the same. For WKT versus SRID, we can skip the first letters until we reach the first punctuation character. If it is :, this is probably a HTTP, URN or authority code. If it is [ or (, this is probably a WKT 1 or 2. Note that WKT allows both [ and (, even if in practice I saw only the former.

@cholmes
Copy link

cholmes commented Jul 30, 2024

One can put WKT2 CRS, SRID, PROJJSON, CRSJSON in this string value. It is the reader / writer's responsibility to figure out what the string is. Does this make sense?

I do think it'll help to allow as few options as possible, and ideally just one. The geospatial world too often imposes all of our intricacies and confusions on the rest of the world, which usually leads to it not getting adopted in the 'right' way, and people starting over from scratch. So I think it's very much worth figuring out the right 'path' for an implementor to go from 0 geospatial knowledge to fully implementing it. Pushing 5+ different options that all have slightly different trade-offs onto the responsibility of the authors of readers/writers isn't going to lead to a great outcome.

I do think for iceberg the right option is SRID, but shipping with a spatial_ref_sys table of SRID to wkt, like PostGIS and GeoPackage do, so it complies with simple features for sql specification (section 6.1.3). For GeoParquet we couldn't do that as we don't have the concept of tables, so couldn't ship with a spatial_ref_sys definition and PROJJSON emerged as the best option.

In GeoIceberg Phase1, we will hard-code it to a value OGC:CRS84.

That sounds like a good first approach - get everything working well with that, and then figure out projections later. But I do think we should work as a spatial community to get to one clear answer, instead of allowing writers to put in any value they want. At the very least there should be one strongly recommended option.

@kylebarron
Copy link

Pushing 5+ different options that all have slightly different trade-offs onto the responsibility of the authors of readers/writers isn't going to lead to a great outcome

💯

We had a lot of discussion in GeoParquet around CRS and PROJJSON emerged as the favorite. See opengeospatial/geoparquet#90, and in particular opengeospatial/geoparquet#90 (comment), and opengeospatial/geoparquet#96. For any system that doesn't have access to PROJ and wants to understand something about the input CRS, parsing a WKT string is particularly terrible, while every language can parse JSON.

@desruisseaux
Copy link

I agree that reducing the amount of options is desired. However, the argument in favour of PROJJSON is biased. It assumes that there is only two options: having PROJ, or having no referencing library at all. The third option, having a referencing library other than PROJ (e.g. ESRI, GeoTools, Apache SIS, Proj4J, PyCRS and more that I don't know) seems completely ignored. Those libraries support WKT, not PROJJSON. A standard CRS JSON is very likely to happen, just not now. It may be a matter of about 2 years. This delay is the price to pay for better consistency with ISO 19111 and ISO 19115-4.

In the meantime, if the community decides to exclude WKT, I would be in favour of only SRID with one amendment: if there is a desire to use EPSG codes with (east, north) axis order, consider making it explicit with a Permutation field as defined by ISO 19107:2019 §6.2.8.6 (note: it may be an issue for GeoParquet instead than Iceberg).

@kylebarron
Copy link

Those libraries support WKT, not PROJJSON

The conversion between PROJJSON and WKT 2 is (relatively) simple https://github.com/rouault/projjson_to_wkt

@desruisseaux
Copy link

The conversion between PROJJSON and WKT 2 is (relatively) simple

The argument works in both ways: we could store WKT in Iceberg, and let applications convert to PROJJSON if desired. It would be more conform to the usual practice of distributing data in a standard format, and let everyone convert to their own "proprietary" format if desired (in this case, "proprietary" means specific to a single project, even if open source).

@kylebarron
Copy link

The argument works in both ways

Yes, except that every language can easily parse JSON; parsing WKT (even to just check for specific fields) is a tall order to do correctly without an external library.

@desruisseaux
Copy link

Yes, except that every language can easily parse JSON

Well, in Java we need an external library for parsing JSON. But we are going in circles: JSON is easier to parse for non-geospatial libraries, but WKT is better supported by all geospatial libraries other than PROJ. It is not obvious to said which side is more important.

@dmeaux
Copy link

dmeaux commented Jul 30, 2024

I agree with Martin. Going with PROJJSON is trading the long-term stability of WKT2 and OGC standards for expediency. I would prefer that the geometry is stored in a format that has been through the OGC/ISO standards process(es) because it leads to long-term stability and more fluid transitions as the technology evolves. CRSes, being at the core of everything we do, are too important to go with something that doesn't meet those standards. It will come back to bite us in the long run. Over the long-term it will end up causing more pain to go with PROJJSON than the sticking with WKT2 for the near-term and adding more standards as they are developed.

@cholmes
Copy link

cholmes commented Jul 31, 2024

I agree that reducing the amount of options is desired. However, the argument in favour of PROJJSON is biased. It assumes that there is only two options: having PROJ, or having no referencing library at all.

PROJJSON is an open standard, not a reference library. It follows the rich tradition of geojson, georss, vector tiles, STAC, mbtiles, pmtiles, flatgeobuf, zarr, copc and many others in that it has started in the open source community and in real usage, and most have evolved to some form of formal standardization. Yes, PROJJSON right now only has a single implementation, but it is written as a JSON encoding of WKT2:2019, and the goal is to become a standard.

The third option, having a referencing library other than PROJ (e.g. ESRI, GeoTools, Apache SIS, Proj4J, PyCRS and more that I don't know) seems completely ignored.

No, that's not completely ignored - those just don't yet implement projjson. To me the next step is to push for them to implement it, and to try to find funding to enable that. The twist seems to be that many don't fully implement WKT2:2019. If they have a wkt2 implementation the parsing from JSON to wkt seems to be fairly easy - it took a day or two to do it for javascript. If OGC insists on a CRSJSON that differs too much from PROJJSON then libraries should be able to parse both and put them into the same WKT2:2019 data model.

A standard CRS JSON is very likely to happen, just not now. It may be a matter of about 2 years. This delay is the price to pay for better consistency with ISO 19111 and ISO 19115-4.

PROJJSON is not 1.0, and can easily evolve to be completely consistent with how the CRS spec evolves. But we need something that works today, not two years from now. Like I said above my hope is that PROJJSON can evolve to be consistent with CRSJSON, or even merge them. But if there don't manage to 100% align then libraries should be able to easily parse both.

But we are going in circles: JSON is easier to parse for non-geospatial libraries, but WKT is better supported by all geospatial libraries other than PROJ. It is not obvious to said which side is more important.

If we want geospatial to have a bigger impact on the world than the size of the existing geospatial market it is clear to me that being easier to parse for non-geospatial libraries is more important. We can't expect every implementation of iceberg to include geospatial libraries, so we need a smooth 'on-ramp' for implementors to support geospatial without understanding the depths of coordinate reference systems. We have a great start, with just focusing on OGC:CRS84. Having a next step be to just understand a few common CRS's by parsing JSON seems like a good way to meet people 'more than half way'. And then geospatial libraries can evolve to support JSON encoding of CRS's (PROJJSON and/or CRS JSON) - and ideally we in the geospatial community work out that set of recommendations.

For now I think that bit is more important for GeoParquet, where the clear 'native' format to use for Parquet metadata is JSON. And I think we should all work together to get to a path from where we are today to the two year goal - we are loath to do a 2.0 for GeoParquet, but we could consider it if there is clear consensus between the various geospatial communities on the need for a breaking change from PROJJSON.

For Iceberg I do think the best answer is the SPATIAL_REF_SYS table, text from the core spec

6.1.3 Identification of Spatial Reference Systems

Every Geometry Column and every geometric entity is associated with exactly one Spatial Reference System.
The Spatial Reference System identifies the coordinate system for all geometric objects stored in the column, and
gives meaning to the numeric coordinate values for any geometric object stored in the column. Examples of
commonly used Spatial Reference Systems include ―Latitude Longitude‖ and ―UTM Zone 10‖.

The SPATIAL_REF_SYS table stores information on each Spatial Reference System in the database. The
columns of this table are the Spatial Reference System Identifier (SRID), the Spatial Reference System Authority
Name (AUTH_NAME), the Authority Specific Spatial Reference System Identifier (AUTH_SRID) and the Wellknown Text description of the Spatial Reference System (SRTEXT). The Spatial Reference System Identifier
(SRID) constitutes a unique integer key for a Spatial Reference System within a database.

Interoperability between clients is achieved via the SRTEXT column which stores the Well-known Text
representation for a Spatial Reference System.

And there are additional details in postgis docs and geopackage spec.

This allows SRID to be used, but includes a table of all the core WKT values to map to those SRID's, and lets users define their own.

I think this means that core iceberg should not need to know PROJJSON. I do still believe PROJJSON is the best choice for GeoParquet and Parquet, and we can continue to work together to figure out the best approach there so the entire ecosystem works well.

@jiayuasu
Copy link
Member

jiayuasu commented Aug 1, 2024

Thank you all for the great discussion. We will focus on the Parquet geometry proposal for now, then come back to the Iceberg one.

As I already commented in the Parquet Geometry proposal, according to the comment above, my suggestion is

In the Parquet Geometry PR, we add a string field namely crs_kind in addition to the crs field. The only allowed value currently is PROJJSON. In the future, if there is a new OGC standard called CRSJSON that differs from PROJJSON, we will allow another value CRSJSON.

For WKT2 2019 <-> PROJJSON, we will implement Java/C++ version of this library rouault/projjson_to_wkt so whoever wants to use WKT2 2019 CRS can use it to get it from the projjson string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Iceberg Improvement Proposal (spec/major changes/etc)
Projects
None yet
Development

No branches or pull requests

7 participants