fixes 707: use sql.unnest instead of sql.join for bulk insert #771

sadiqkhoja · 2023-02-15T21:42:21Z

Closes #707

What has been done to verify that this works as intended?

Unit test, integration test and manually tested with the form with 25K questions.

Why is this the best possible solution? Were any other approaches considered?

Postgresql has a limit of parameters that can be passed in a query. To insert large number of rows it provides unnest function. Alternative is to break the rows into batches which is not performant and would require a lot of code changes.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Test cases related to multiselect question, client audits, form attachments and submission attachments should be executed.

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

no

Before submitting this PR, please make sure you have:

run make test-full and confirmed all checks still pass OR confirm CircleCI build passes
verified that any code from external sources are properly credited in comments

sadiqkhoja · 2023-02-15T21:50:24Z

lib/model/frame.js

+const fieldTypes = (types) => (def) => {
+  def.fieldTypes = types;
+};
+


Being pragmatic here, don't want to change a whole lot of things. With this function, we can define fieldTypes for only those frames that are bulk inserted using insertMany function.

Alternative was to change the def.fields from array of string to array of object with name, type, readable, writeable, etc or have associated array of types and define type for all fields in every frame.

we can remove this when and if gajus/slonik#455 is implemented

sadiqkhoja · 2023-02-15T21:52:17Z

lib/util/db.js

+
+  // we need to set clock_timestamp if there's createdAt column
+  // Slonik doesn't support setting sql identitfier for sql.unnest yet
+  if (Type.hasCreatedAt) {


this special treatment of createdAt field is due to an existing unit test, we don't have any frame with createdAt that we bulk insert today. This is nice to have for future use cases.

ktuite

Writing out my understanding after an interactive code review with @sadiqkhoja:

We use a function called insertMany to do bulk inserting of certain things including Form Fields, Submission Attachments, Comments, Audits, and Client Audits.

insertMany was using sql.join internally, but broke down in extreme cases such as a very large form with 25K fields. In such a form, we store ~10 columns about each form field, so sql.join was forming a query with 250,000 placeholders that was too much (exceeding some stack size in node). Sadiq pinpointed the line in slonik (in sql.join) that was the problem, where there was a spread operator that was failing on those 250K values... but rather than trying to change slonik, he found a different sql.unnest slonik function to use instead.

sql.unnest works by taking arrays of placeholder values, e.g., when inserting 25K of 10-column form fields, it will have 10 placeholders where each is a 25K length array of every value for that one column. The one caveat of using this function is that it takes a second required argument to list out the types of each column.

Most of the code in this PR is about modifying our frames to also define field types on any frame that needs to use insertMany. If field types are missing and insertMany is called anyway, there is a 500 Problem explaining that.

To summarize

insertMany changed to use sql.unnest instead of sql.join.
Frames that use insertMany must now also specify a fieldTypes array for the properties in that frame.

alxndrsn · 2023-02-21T09:05:11Z

Another approach to this could be using json_to_recordset()/jsonb_to_recordset() - https://www.postgresql.org/docs/current/functions-json.html.

I think this could allow for a query with a single placeholder.

fixes 707: use sql.unnest instead of sql.join for bulk insert

49aec88

sadiqkhoja commented Feb 15, 2023

View reviewed changes

sadiqkhoja marked this pull request as ready for review February 15, 2023 21:52

sadiqkhoja requested a review from ktuite February 15, 2023 21:52

ktuite approved these changes Feb 16, 2023

View reviewed changes

sadiqkhoja merged commit 222b2a8 into getodk:master Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes 707: use sql.unnest instead of sql.join for bulk insert #771

fixes 707: use sql.unnest instead of sql.join for bulk insert #771

sadiqkhoja commented Feb 15, 2023 •

edited

Loading

sadiqkhoja Feb 15, 2023

sadiqkhoja Feb 15, 2023

sadiqkhoja Feb 15, 2023

ktuite left a comment

alxndrsn commented Feb 21, 2023 •

edited

Loading

fixes 707: use sql.unnest instead of sql.join for bulk insert #771

fixes 707: use sql.unnest instead of sql.join for bulk insert #771

Conversation

sadiqkhoja commented Feb 15, 2023 • edited Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

Before submitting this PR, please make sure you have:

sadiqkhoja Feb 15, 2023

Choose a reason for hiding this comment

sadiqkhoja Feb 15, 2023

Choose a reason for hiding this comment

sadiqkhoja Feb 15, 2023

Choose a reason for hiding this comment

ktuite left a comment

Choose a reason for hiding this comment

alxndrsn commented Feb 21, 2023 • edited Loading

sadiqkhoja commented Feb 15, 2023 •

edited

Loading

alxndrsn commented Feb 21, 2023 •

edited

Loading