Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

qqmyers · 2020-01-07T19:34:07Z

One of the bottlenecks to handling large files (e.g. 1-100GB+) in Dataverse is that the existing implementation streams them through the Glassfish server and creates a local temporary file.

Working for TDL, I've developed a mechanism to directly upload files to S3 stores using presigned URLs. The overall concept is analogous to how the direct s3 download works, in which Dataverse creates a presigned URL allowing download of a file that it then sends to the client's browser, which redirects to that URL and directly retrieves the file content from the s3 store. To support upload, I added an API call to request a presigned direct upload URL and have then adapted the DVUploader to use it. I've also used Javascript to catch upload requests in the Dataverse web interface and, for a store that supports direct upload, use Ajax calls to transfer the file directly. I then extended the existing upload methods to be able to accept the storageId given to that file instead of the file content stream itself.

In the initial implementation for TDL, I disabled any ingest processing that requires access to the file contents (i.e. unzip, format conversion, metadata extraction, mime-type determination). Per discussion on the DV Community call, my intent is to restore all of that capability, perhaps up to a file size limit and perhaps excluding the unzip step. (The primary concern is that the benefit of being able to stream larger files direct to S3 will be negated if Dataverse then retrieves the file, unzips its content, etc. A size limit would be a simple way to handle this concern to start.)

TDL is currently testing this code and we've uploaded a 39GB file this way already. We have not yet tested performance relatively to the existing uploads (for smaller files where both should work) but hope to do so. The current implementation for the web interface streams the file to S3 and then calculates an MD5 hash locally on the user's machine before triggering Dataverse to update the dataset so nominally the overall performance should be similar to the raw performance of the S3 upload plus the local machine's MD5 hash performance. (As with the upload to Glassfish, multiple files upload in parallel in the web interface, helping to maximize use of available bandwidth.)

One additional benefit of this approach is that it is ~transparent to the user. There is no visible change to the file upload process through the Dataverse web interface.

I'll be creating a PR, possibly building off of #6488 that supports multiple stores. Whether this is something that makes sense to merge in in its current form is TBD, but the approach looks promising enough that I wanted to get it on the radar.

qqmyers mentioned this issue Jan 7, 2020

Iqss/6489 - Direct uploads to S3 using presigned URLs #6490

Merged

poikilotherm mentioned this issue Feb 17, 2020

Refactor file upload from web UI and temporary storage #6656

Closed

djbrooke mentioned this issue Mar 5, 2020

4.20 Release Notes #6727

Closed

mheppler mentioned this issue Mar 20, 2020

Upload File - Console warns js file included more than once #6756

Closed

kcondon closed this as completed in #6490 Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

qqmyers commented Jan 7, 2020 •

edited

Loading

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

Support larger file uploads to S3 stores via direct upload using presigned URLs #6489

Comments

qqmyers commented Jan 7, 2020 • edited Loading

qqmyers commented Jan 7, 2020 •

edited

Loading