Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add code to generalize data using different strategies #1830

Closed
wants to merge 1 commit into from

Conversation

joto
Copy link
Collaborator

@joto joto commented Nov 21, 2022

This commit adds the code to generalize polygons using two different strategies. In both cases polygons are merged and simplified on a tile-by-tile basis.

The "vector" strategy uses the commonly used approach where polygons are buffered, their union is calculated and then a reverse buffer applied, followed by a final buffer again.

The "raster" strategy does a similar thing but does it in raster space which is much faster. First the polygons are rendered into a raster, an open/close operation is called (which basically does the same thing as the buffering in vector space) and finally the resulting raster is vectorized again.

For the raster support this adds two new library dependency: CImg and potrace.

The functionality is accessed through a new command line program called osm2pgsql-gen. Call it with -h to get some usage information. This program is for testing only, eventually the functionality should be accessible from osm2pgsql itself (using the Lua config file for configuration).

Some additional information is in README-gen.md

@pnorman
Copy link
Collaborator

pnorman commented Nov 21, 2022

Is the main goal here to verify that the algorithms produce useful results before implementing in osm2pgsql?

@joto
Copy link
Collaborator Author

joto commented Nov 21, 2022

This is part of the generalization effort (https://osm2pgsql.org/generalization/).

For some context see

Is the main goal here to verify that the algorithms produce useful results before implementing in osm2pgsql?

Yes. And to see how performant it is. It doesn't help us if it can't be used in real-world scenarios.

@mboeringa
Copy link

mboeringa commented Nov 27, 2022

@joto

Interesting reads those blog posts.

A couple of remarks though:

  • Unless you use ST_Buffer's 'num_seg_quarter_circle' setting in the vector option, you may in fact end up with far more vertices in the output than in the input polygons, and not be generalizing the data at all... Using 'num_seg_quarter_circle' with a low value like 2 or 3, should considerably speed up the (subsequent) processing as well.
  • It may be highly recommended to add an additional true generalization step using 'ST_Simplify/ST_SimplifyPreserveTopology/ST_SimplifyVW' after the buffer operation, for the same reason as listed above: truly reduce the number of vertices and processing requirements.
  • There is a high chance of producing invalid output with Polygon generalization. In fact, when I ran the QGIS's 'Check Validity' option two years ago or so against the generalized land and water polygons as created by the coastline processing you refer to and as available on https://osmdata.openstreetmap.de/data/, QGIS detected multiple errors. I don't know the current situation, I should re-check it, but from my own experience developing my own algorithms, it is easy to run into trouble with invalid Polygons (and it can be really hard to fix it and get 100% valid output). In my particular case, running 'ST_MakeValid' was not enough to deal with all issues (but ArcGIS is even more finicky in this respect due to the different Polygon validity framework, although any OGC valid one is also valid in the ESRI world according to https://pro.arcgis.com/en/pro-app/latest/tool-reference/data-management/check-geometry.htm).
  • Anyway, since the goal of osm2pgsql was to always produce valid output, I would certainly recommend carefully reviewing any generalization results from the developed algorithms for issues, as invalid geometries can cause a range of issues for subsequent use.

@mboeringa
Copy link

A few more remarks:

  • I also recommend testing any generalization algorithm at a global scale. Small regional extracts may not reveal all issues, especially related to the anti-meridian, and how some PostGIS commands deal with that.
  • Mind the "fine print" with any of the 'ST_Simplify/ST_SimplifyPreserveTopology/ST_SimplifyVW' PostGIS commands as listed in the PostGIS documentation.
  • Output from the current, non-generalized, osm2pgsql processing pipeline, has shown to be very robust in both QGIS and ArcGIS for me, with fully valid readable output, even with the more advanced 'flex' processing.

@joto
Copy link
Collaborator Author

joto commented Dec 5, 2022

You are raising some valid points, @mboeringa. In fact the first version of the code in this PR would generate polygons that were all invalid! (The rings were not closed, which is fixed now.) I also added an optional ST_MakeValid() step to the raster-union strategy which seems to work well in my experiments (use -p make_valid to enable) and seems to not take too much time. But overall I am not too worried about validity. The generalized data is not that accurate anyway, that's the whole point of generalization, and small rendering errors will probably not matter. Still, it is better to have things valid. We'll need some more real-world experience with this to see how important that is and whether it is worth the extra effort.

Unfortunately the potrace library will sometimes generate invalid polygons. I have seen two cases: Sometimes it snips off corners of the tiles it renders, this can be mitigated by having larger margins around tiles, but that's certainly something to look out for. The other happens when rendering polygons touching in single points (think of the figure "8") which must be modelled as two polygons, not one. There is another problem with the potrace library which I couldn't figure out yet: The left side of all polygons is slightly to the left of where it should be, the right side is okay. Maybe some kind of rounding error. You can't usually see it in the results though, because the effect is small, but still somewhat annoying.

You also mentioned the num_seg_quarter_circle setting and the need for simplification. I have played around with this in a different context, but didn't do that in this project. Using fewer segments for the buffer makes processing noticably faster, so that is an important optimization, but I am focusing on the new raster-based approach which is even faster and it generates far smaller polygons (with less vertexes) to begin with, so I believe there is no need for the extra simplification step. But it would be interesting to try this out. There is a huge experimental space here and I fully expect that we'll tweak code and parameters for a long time to come to better results. Anybody who wants to play around with this, please go ahead and report your findings here.

@pnorman
Copy link
Collaborator

pnorman commented Dec 5, 2022

The generalized data is not that accurate anyway, that's the whole point of generalization, and small rendering errors will probably not matter. Still, it is better to have things valid

Having a guarantee of validity is very important since lots of PostGIS functions require validity.

README-gen.md Outdated
* `-w, --width=WIDTH`: Size of the imaged rendered when using the `raster-union`
strategy, not used in the `vector-union` strategy.
* `-b, --buffer=BUFFER`: The amount by which the polygons will be buffered. For
the `raster-union` strategy this is in pixels, for the `vector-union` strategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation doesn't say what resolution is used per tile

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, there is a lot the documentation doesn't spell out clearly. That's because this is still work-in-progress. You have to do a bit of experimenting if you want to try this.

Copy link
Collaborator

@pnorman pnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I noticed while getting it running. Not a complete review.

  • There's a bunch of hard-coded 3857 assumptions. If we're okay with these, we need to document them.

connection->query(
PGRES_COMMAND_OK,
"PREPARE get_extent AS WITH ex AS "
"(SELECT ST_EstimatedExtent('{}', 'geom') AS e) "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes the geometry column to geom

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes in new version

xmax = 0;
ymax = 0;
} else {
if (x == 0 && y == 0 && xmax == 0 && ymax == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero shouldn't be treated as a special value, because you might just want to generate the 0/0 tile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in new version

README-gen.md Outdated
* `-b, --buffer=BUFFER`: The amount by which the polygons will be buffered. For
the `raster-union` strategy this is in pixels, for the `vector-union` strategy
this is in Mercator map units.
* `-g, --group-by-column=COL`: Set this to the column describing the type of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are multiple columns indicated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible. If that's really something we need, we can add that later. But I am trying to keep the config to a minimum.

@joto
Copy link
Collaborator Author

joto commented Dec 13, 2022

New version documents that this only works for 3857. Also added a "make_valid" option to ensure results are valid polygons. Lots of bug fixes. And other stuff... I'll get around to documenting this eventually...

README-gen.md Outdated
called `type` with the type of the landcover (forest, grass, etc.).

Create a table `landcover_z10` with (at least) columns `geom` and `type` as
well as integer columns `x`, and `y`. Then call this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An index on x, y is also needed

Copy link
Collaborator Author

@joto joto Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use osm2pgsql-gen on larger extracts or the planet you need to run it with -D on first import to suppress DELETEs, then built indexes on x, y and when you do updates run without the -D. The problem is that by default osm2pgsql-gen will DELETE all features in a tile in the output table before loading them. But when you start that for the first time, i.e. when the PREPARE is done, the table is empty, so even if there is an index PostgreSQL will decide not to use it. That's fine in the beginning, but as the table grows everything becomes slower and slow.

After the import you create the index and do an ANALYZE and from then on everything is fine when doing updates.

This is a bit tricky, but once we put this functionality into osm2pgsql itself, we know whether we are in import or append mode and can do the right thing automatically.

@pnorman
Copy link
Collaborator

pnorman commented Dec 16, 2022

I've been experimenting with the results now, and am having an error where the lower-left corner of each tile's polygon is being clipped away

I created the base table with

CREATE MATERIALIZED VIEW
osm_planet_trees
AS
SELECT
    osm_id,
    way,
    name,
    ref,
    way_area
FROM planet_osm_polygon
WHERE "natural" = 'wood' OR landuse = 'forest'

then ran
PGDATABASE=flex time ~/osm/osm2pgsql/build/osm2pgsql-gen -t planet_osm_trees -T trees_z10 -g way -G name -s raster-union -x0 -X1024 -y0 -Y1024 -z10 -m 0.125

The result in qgis looks like this:

image

@joto
Copy link
Collaborator Author

joto commented Dec 16, 2022

I've been experimenting with the results now, and am having an error where the lower-left corner of each tile's polygon is being clipped away

I believe that's already been fixed in the current version of this PR. Did you use that?

@joto joto force-pushed the wip-gen-tile branch 4 times, most recently from dd1c8f3 to 7891280 Compare December 22, 2022 15:30
@danieldegroot2

This comment was marked as off-topic.

@joto joto changed the title WIP: Add code to generalize polygons using two different strategies WIP: Add code to generalize data using different strategies Dec 28, 2022
@joto joto force-pushed the wip-gen-tile branch 3 times, most recently from e56cb0e to f7430aa Compare January 31, 2023 18:19
@joto joto force-pushed the wip-gen-tile branch 5 times, most recently from 285255d to 0613222 Compare February 25, 2023 11:11
This commit adds the code to generalize various types of data using
different strategies.

The following strategies work on a tile-by-tile basis and operate on
polygons:

The "vector-union" strategy buffers and unionizes polygons using vector
operations.

The "raster-union" strategy does a similar thing but does it in raster
space which is much faster. First the polygons are rendered into a
raster, an open/close operation is called (which basically does the same
thing as the buffering in vector space) and finally the resulting raster
is vectorized again.

The "builtup" strategy is intended to derive a layer of builtup areas
from landuse=residential/industrial etc. as well as building cover and
dense road networks. This still needs some work...

Also a new "discrete-isolation" strategy which rates places based on
some importance metric. (This is not tile-based.)

The new "rivers" strategy finds important rivers, this is still very
much work in progress.

For the raster support this adds two new library dependency: CImg and
potrace.

The functionality is accessed through a new command line program called
osm2pgsql-gen. It reads the same Lua config file that osm2pgsql reads.
Call it with -h to get some usage information. This program is for
testing only, eventually the functionality should be accessible from
osm2pgsql itself.

See also https://osm2pgsql.org/generalization/ .
@joto
Copy link
Collaborator Author

joto commented Mar 2, 2023

Closing in favour of #1942.

@joto joto closed this Mar 2, 2023
@joto joto deleted the wip-gen-tile branch March 6, 2023 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants