-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes 707: use sql.unnest instead of sql.join for bulk insert #771
Conversation
const fieldTypes = (types) => (def) => { | ||
def.fieldTypes = types; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being pragmatic here, don't want to change a whole lot of things. With this function, we can define fieldTypes for only those frames that are bulk inserted using insertMany
function.
Alternative was to change the def.fields from array of string to array of object with name, type, readable, writeable, etc or have associated array of types and define type for all fields in every frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this when and if gajus/slonik#455 is implemented
|
||
// we need to set clock_timestamp if there's createdAt column | ||
// Slonik doesn't support setting sql identitfier for sql.unnest yet | ||
if (Type.hasCreatedAt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this special treatment of createdAt field is due to an existing unit test, we don't have any frame with createdAt that we bulk insert today. This is nice to have for future use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing out my understanding after an interactive code review with @sadiqkhoja:
We use a function called insertMany
to do bulk inserting of certain things including Form Fields, Submission Attachments, Comments, Audits, and Client Audits.
insertMany
was using sql.join
internally, but broke down in extreme cases such as a very large form with 25K fields. In such a form, we store ~10 columns about each form field, so sql.join
was forming a query with 250,000 placeholders that was too much (exceeding some stack size in node). Sadiq pinpointed the line in slonik (in sql.join) that was the problem, where there was a spread operator that was failing on those 250K values... but rather than trying to change slonik, he found a different sql.unnest
slonik function to use instead.
sql.unnest
works by taking arrays of placeholder values, e.g., when inserting 25K of 10-column form fields, it will have 10 placeholders where each is a 25K length array of every value for that one column. The one caveat of using this function is that it takes a second required argument to list out the types of each column.
Most of the code in this PR is about modifying our frames to also define field types on any frame that needs to use insertMany
. If field types are missing and insertMany is called anyway, there is a 500 Problem explaining that.
To summarize
insertMany
changed to usesql.unnest
instead ofsql.join
.- Frames that use
insertMany
must now also specify afieldTypes
array for the properties in that frame.
Another approach to this could be using I think this could allow for a query with a single placeholder. |
Closes #707
What has been done to verify that this works as intended?
Unit test, integration test and manually tested with the form with 25K questions.
Why is this the best possible solution? Were any other approaches considered?
Postgresql has a limit of parameters that can be passed in a query. To insert large number of rows it provides
unnest
function. Alternative is to break the rows into batches which is not performant and would require a lot of code changes.How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
Test cases related to multiselect question, client audits, form attachments and submission attachments should be executed.
Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.
no
Before submitting this PR, please make sure you have:
make test-full
and confirmed all checks still pass OR confirm CircleCI build passes