Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

Closed
qqmyers opened this issue Jan 7, 2020 · 0 comments · Fixed by #6490
Closed

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

qqmyers opened this issue Jan 7, 2020 · 0 comments · Fixed by #6490

Comments

@qqmyers
Copy link
Member

qqmyers commented Jan 7, 2020

One of the bottlenecks to handling large files (e.g. 1-100GB+) in Dataverse is that the existing implementation streams them through the Glassfish server and creates a local temporary file.

Working for TDL, I've developed a mechanism to directly upload files to S3 stores using presigned URLs. The overall concept is analogous to how the direct s3 download works, in which Dataverse creates a presigned URL allowing download of a file that it then sends to the client's browser, which redirects to that URL and directly retrieves the file content from the s3 store. To support upload, I added an API call to request a presigned direct upload URL and have then adapted the DVUploader to use it. I've also used Javascript to catch upload requests in the Dataverse web interface and, for a store that supports direct upload, use Ajax calls to transfer the file directly. I then extended the existing upload methods to be able to accept the storageId given to that file instead of the file content stream itself.

In the initial implementation for TDL, I disabled any ingest processing that requires access to the file contents (i.e. unzip, format conversion, metadata extraction, mime-type determination). Per discussion on the DV Community call, my intent is to restore all of that capability, perhaps up to a file size limit and perhaps excluding the unzip step. (The primary concern is that the benefit of being able to stream larger files direct to S3 will be negated if Dataverse then retrieves the file, unzips its content, etc. A size limit would be a simple way to handle this concern to start.)

TDL is currently testing this code and we've uploaded a 39GB file this way already. We have not yet tested performance relatively to the existing uploads (for smaller files where both should work) but hope to do so. The current implementation for the web interface streams the file to S3 and then calculates an MD5 hash locally on the user's machine before triggering Dataverse to update the dataset so nominally the overall performance should be similar to the raw performance of the S3 upload plus the local machine's MD5 hash performance. (As with the upload to Glassfish, multiple files upload in parallel in the web interface, helping to maximize use of available bandwidth.)

One additional benefit of this approach is that it is ~transparent to the user. There is no visible change to the file upload process through the Dataverse web interface.

I'll be creating a PR, possibly building off of #6488 that supports multiple stores. Whether this is something that makes sense to merge in in its current form is TBD, but the approach looks promising enough that I wanted to get it on the radar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant