Middle: Get dependencies with single queries #1996
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In append mode osm2pgsql has to find all parent ways and relations of changed nodes and parent relations of changed ways. This is done by with one query per object read from the input file which is quite wasteful. This changes the code to only do a single query for all dependencies. The list of all changed ids is COPYied into a temporary table. Then parent objects ids are found and written into another temporary table and then retrieved. (We could this without the extra temporary tables, but this could allow us later to use the content of those tables somewhere else, so I chose to do it this way. Should make a big difference perfomance-wise.)
These PR also fixes #423, the longstanding problem that ways and relations that are in the input file AND parents of changed objects were process twice.
The work is split between three commits, one each for nodes, ways, and relations processing.
I did some performance measurements, in every one the new code was more efficient. For smaller (minutely) diffs the difference is small, about 30% performance improvements measured over 2000 something diffs. At the other end of the spectrum the performance gain is huge. Updating a planet file with a diff with about two weeks of data was an order of magnitude faster (from >10h down to an hour).