Middle: Get dependencies with single queries #1996

joto · 2023-07-13T15:24:16Z

In append mode osm2pgsql has to find all parent ways and relations of changed nodes and parent relations of changed ways. This is done by with one query per object read from the input file which is quite wasteful. This changes the code to only do a single query for all dependencies. The list of all changed ids is COPYied into a temporary table. Then parent objects ids are found and written into another temporary table and then retrieved. (We could this without the extra temporary tables, but this could allow us later to use the content of those tables somewhere else, so I chose to do it this way. Should make a big difference perfomance-wise.)

These PR also fixes #423, the longstanding problem that ways and relations that are in the input file AND parents of changed objects were process twice.

The work is split between three commits, one each for nodes, ways, and relations processing.

I did some performance measurements, in every one the new code was more efficient. For smaller (minutely) diffs the difference is small, about 30% performance improvements measured over 2000 something diffs. At the other end of the spectrum the performance gain is huge. Updating a planet file with a diff with about two weeks of data was an order of magnitude faster (from >10h down to an hour).

For each changed node osm2pgsql has to find all ways and all relations referencing that node and make sure they are updated as well. This was done by doing one SQL query per node which is quite inefficient. Instead the new code collects all node ids and does only one query for the parent ways and one for the parent relations after all nodes have been read. This implementation uses temporary tables. The changed nodes are written to a temp table and from that temp tables are filled with the parent nodes and ways. Strictly speaking the latter tables are not necessary, we could read the data directly from the query. But maybe those tables could be useful in the future for some additional processing in the database. (They'd need to be converted to non-temp tables then, though.)

For each changed way osm2pgsql has to find all relations referencing that way and make sure they are updated as well. This was done by doing one SQL query per way which is quite inefficient. Instead the new code collects all way ids and does only one query for the parent relations after all ways have been read. This commit also makes sure that ways which have been in the input file are removed from the pending way tracker so that they are not processed again later.

In append mode relations could be processed twice if they are in the input and also the parent of a node or way in the input. With this change we are removing all ids of changed relations from the list of pending relations before processing them. Fixes osm2pgsql-dev#423.

mboeringa · 2023-07-13T17:52:04Z

Should make a big difference perfomance-wise.

I guess you meant to write "Shouldn't make a big difference performance-wise"?

joto added 3 commits July 13, 2023 17:10

lonvia merged commit fb6bd80 into osm2pgsql-dev:master Jul 14, 2023
27 checks passed

joto deleted the single-query-dependencies branch July 14, 2023 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Middle: Get dependencies with single queries #1996

Middle: Get dependencies with single queries #1996

joto commented Jul 13, 2023

mboeringa commented Jul 13, 2023

Middle: Get dependencies with single queries #1996

Middle: Get dependencies with single queries #1996

Conversation

joto commented Jul 13, 2023

mboeringa commented Jul 13, 2023