Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes 707: use sql.unnest instead of sql.join for bulk insert #771

Merged
merged 1 commit into from
Feb 17, 2023

Conversation

sadiqkhoja
Copy link
Contributor

@sadiqkhoja sadiqkhoja commented Feb 15, 2023

Closes #707

What has been done to verify that this works as intended?

Unit test, integration test and manually tested with the form with 25K questions.

Why is this the best possible solution? Were any other approaches considered?

Postgresql has a limit of parameters that can be passed in a query. To insert large number of rows it provides unnest function. Alternative is to break the rows into batches which is not performant and would require a lot of code changes.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Test cases related to multiselect question, client audits, form attachments and submission attachments should be executed.

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

no

Before submitting this PR, please make sure you have:

  • run make test-full and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments

Comment on lines +29 to +32
const fieldTypes = (types) => (def) => {
def.fieldTypes = types;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being pragmatic here, don't want to change a whole lot of things. With this function, we can define fieldTypes for only those frames that are bulk inserted using insertMany function.

Alternative was to change the def.fields from array of string to array of object with name, type, readable, writeable, etc or have associated array of types and define type for all fields in every frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this when and if gajus/slonik#455 is implemented


// we need to set clock_timestamp if there's createdAt column
// Slonik doesn't support setting sql identitfier for sql.unnest yet
if (Type.hasCreatedAt) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this special treatment of createdAt field is due to an existing unit test, we don't have any frame with createdAt that we bulk insert today. This is nice to have for future use cases.

@sadiqkhoja sadiqkhoja marked this pull request as ready for review February 15, 2023 21:52
Copy link
Member

@ktuite ktuite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing out my understanding after an interactive code review with @sadiqkhoja:

We use a function called insertMany to do bulk inserting of certain things including Form Fields, Submission Attachments, Comments, Audits, and Client Audits.

insertMany was using sql.join internally, but broke down in extreme cases such as a very large form with 25K fields. In such a form, we store ~10 columns about each form field, so sql.join was forming a query with 250,000 placeholders that was too much (exceeding some stack size in node). Sadiq pinpointed the line in slonik (in sql.join) that was the problem, where there was a spread operator that was failing on those 250K values... but rather than trying to change slonik, he found a different sql.unnest slonik function to use instead.

sql.unnest works by taking arrays of placeholder values, e.g., when inserting 25K of 10-column form fields, it will have 10 placeholders where each is a 25K length array of every value for that one column. The one caveat of using this function is that it takes a second required argument to list out the types of each column.

Most of the code in this PR is about modifying our frames to also define field types on any frame that needs to use insertMany. If field types are missing and insertMany is called anyway, there is a 500 Problem explaining that.

To summarize

  • insertMany changed to use sql.unnest instead of sql.join.
  • Frames that use insertMany must now also specify a fieldTypes array for the properties in that frame.

@sadiqkhoja sadiqkhoja merged commit 222b2a8 into getodk:master Feb 17, 2023
@alxndrsn
Copy link
Contributor

alxndrsn commented Feb 21, 2023

Another approach to this could be using json_to_recordset()/jsonb_to_recordset() - https://www.postgresql.org/docs/current/functions-json.html.

I think this could allow for a query with a single placeholder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error not handled when uploading a big form
3 participants