Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Middle: Get dependencies with single queries #1996

Merged
merged 3 commits into from
Jul 14, 2023

Conversation

joto
Copy link
Collaborator

@joto joto commented Jul 13, 2023

In append mode osm2pgsql has to find all parent ways and relations of changed nodes and parent relations of changed ways. This is done by with one query per object read from the input file which is quite wasteful. This changes the code to only do a single query for all dependencies. The list of all changed ids is COPYied into a temporary table. Then parent objects ids are found and written into another temporary table and then retrieved. (We could this without the extra temporary tables, but this could allow us later to use the content of those tables somewhere else, so I chose to do it this way. Should make a big difference perfomance-wise.)

These PR also fixes #423, the longstanding problem that ways and relations that are in the input file AND parents of changed objects were process twice.

The work is split between three commits, one each for nodes, ways, and relations processing.

I did some performance measurements, in every one the new code was more efficient. For smaller (minutely) diffs the difference is small, about 30% performance improvements measured over 2000 something diffs. At the other end of the spectrum the performance gain is huge. Updating a planet file with a diff with about two weeks of data was an order of magnitude faster (from >10h down to an hour).

joto added 3 commits July 13, 2023 17:10
For each changed node osm2pgsql has to find all ways and all relations
referencing that node and make sure they are updated as well. This was
done by doing one SQL query per node which is quite inefficient. Instead
the new code collects all node ids and does only one query for the
parent ways and one for the parent relations after all nodes have been
read.

This implementation uses temporary tables. The changed nodes are written
to a temp table and from that temp tables are filled with the parent
nodes and ways. Strictly speaking the latter tables are not necessary,
we could read the data directly from the query. But maybe those tables
could be useful in the future for some additional processing in the
database. (They'd need to be converted to non-temp tables then, though.)
For each changed way osm2pgsql has to find all relations referencing
that way and make sure they are updated as well. This was done by doing
one SQL query per way which is quite inefficient. Instead the new code
collects all way ids and does only one query for the parent relations
after all ways have been read.

This commit also makes sure that ways which have been in the input file
are removed from the pending way tracker so that they are not processed
again later.
In append mode relations could be processed twice if they are in the
input and also the parent of a node or way in the input. With this
change we are removing all ids of changed relations from the list of
pending relations before processing them.

Fixes osm2pgsql-dev#423.
@mboeringa
Copy link

Should make a big difference perfomance-wise.

I guess you meant to write "Shouldn't make a big difference performance-wise"?

@lonvia lonvia merged commit fb6bd80 into osm2pgsql-dev:master Jul 14, 2023
27 checks passed
@joto joto deleted the single-query-dependencies branch July 14, 2023 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reprocessing relations multiple times
3 participants