Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Restrict API to require direct uploads #10828

Open
jarulsamy opened this issue Sep 9, 2024 · 2 comments
Open

Feature Request: Restrict API to require direct uploads #10828

jarulsamy opened this issue Sep 9, 2024 · 2 comments
Labels
Type: Feature a feature request

Comments

@jarulsamy
Copy link

Overview of the Feature Request

Be able to restrict API uploads to require direct uploads for some Dataverses.

(Apologies if I somehow missed this already existing, I could not find docs for this)

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)

  • API users

What inspired the request?

By default, if a Dataverse is using S3 backend-storage with upload-redirect=true, uploads via the WebUI automatically use direct upload. However, when a user uses the file upload APIs, they are able to circumvent direct upload and upload through the Dataverse instance. Other than being slower, this can cause issues if the user is uploading extremely large datasets that may fill temp space on the Dataverse instance.

What existing behavior do you want changed?

I would like the additional of a per-dataverse setting which can optionally require uploads via direct-upload for API file uploads. Ideally, this would present the user an error immediately if they try to upload using the native API.

Currently, for our instance, users are able to attempt uploading a large file with the API then have it fail part-way through with an internal server error. The admins would see an error like this within the logs:

Failed to save the upload as a temp file (temp disk space?)]]

By adding this feature, we could catch this issue before the user spends time and bandwidth partially uploading their dataset.

Alternatively, if there is some other feature or method to prevent this scenario from happening, I'm more than happy to explore that too. Currently, we are relying on user education and pushing them to use the direct upload APIs whenever possible for datasets over 8GB.

Any open or closed issues related to this feature request?

Not that I could find.

Are you thinking about creating a pull request for this feature?

Ideally I would like to, however it may not be in the near future (3+ months probably).

@jarulsamy jarulsamy added the Type: Feature a feature request label Sep 9, 2024
@qqmyers
Copy link
Member

qqmyers commented Sep 9, 2024

I think this makes sense - I don't think we had any explicit reason to still allow uploads through the Dataverse server. In fact, I don't know that we need a new configuration option and perhaps could just disable normal upload when direct is true (if anyone really needs to do a normal upload, one could configure two stores pointing to the same bucket, one with direct true and the other false, and flip to using the direct=false one if/when needed).

FWIW, the latest versions of DVUploader flip to making direct upload the default (I'm not sure about pyDVUploader). There is also the dvwebloader that can be added to Dataverse - it is direct upload only. These don't stop using the API to upload through the server, but they are ways to guide users away from doing that.

@pdurbin
Copy link
Member

pdurbin commented Sep 9, 2024

@shlake and I talked about this, sort of, on Zulip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
None yet
Development

No branches or pull requests

3 participants