Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of Globus uploads (experimental async framework) #10781

Merged
merged 36 commits into from
Sep 25, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Aug 19, 2024

What this PR does / why we need it:

It solves the 2 problems described in the linked issue:

  • An api call to /addFiles via curl has been replaced with calling the corresponding method in AddReplaceFileHelper. In addition to the problems with the curl implementation already described in the issue, the old code was not making any attempt to parse the output of the api; so if /addFiles dropped one or more files from the submitted entries, for whatever reason, the Globus service method was still registering that as a complete success. The new code checks the results in a more careful way.
  • A new implementation of the upload transfer polling solves the issue where the Dataverse instance needs to stay up and the same service method needs to loop continuously for the duration of the transfer. The new framework relies on a schedule-run execution of the polling calls and saving the state of the ongoing tasks in the database. This gives an extra advantage of giving an admin a relatively easy way to add the files to the user's dataset after the fact in a scenario where the remote Globus transfer succeeded, but the files failed to be added to the dataset automatically, for whatever reason. However, due to the experimental nature of this new framework, I'm leaving it not enabled by default. Pending gaining more evidence on how it holds up in a prod.-like environment.

The handling of diagnostics and notifications have generally been improved in the PR. Among other things, there are now explicit failure notifications that are sent to the users when uploads fail completely. The existing implementation only logs such events without notifying the user.

Which issue(s) this PR closes:

Special notes for your reviewer:

Note that I don't use the EJB timers for the scheduled execution of the task polling. I used the jakarta.enterprise.concurrent.ManagedScheduledExecutorService instead which, from what I understand, is a preferred way now.

Suggestions on how to test this:

In order to test this PR, and other Globus-related things going forward, I added a Globus setup to dataverse-internal yesterday (Globus-enabled storage configuration plus the upload/download web app). This volume:
Screen Shot 2024-09-05 at 4 11 45 PM

Is connected to remote storage at NESE that was set up for us specifically for testing uploads and downloads. This setup is identical to what was previously configured on demo.dataverse.org. I haven't tested it much myself though.

The instructions on how to upload data via Globus in Datataverse can be found here: https://github.com/IQSS/dataverse.harvard.edu/blob/master/doc/globus/upload.md.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

Preview at https://dataverse-guide--10781.org.readthedocs.build/en/10781/installation/config.html#dataverse-globus-taskmonitoringserver

* externally. (?)
*/
@NamedQueries({
@NamedQuery( name="ExternalFileUploadInProgress.deleteByTaskId",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.whitespace.FileTabCharacterCheck> reported by reviewdog 🐶
File contains tab characters (this is the first instance).

@coveralls
Copy link

coveralls commented Aug 19, 2024

Coverage Status

coverage: 20.681% (-0.05%) from 20.734%
when pulling 5dc386f on 10623-globus-improvements
into 8f3fc4a on develop.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@landreev
Copy link
Contributor Author

I'm not sure what's going on with Jenkins/integration tests. But I'll figure that out separately. I'm going to go ahead and un-draft the PR...

@landreev landreev marked this pull request as ready for review August 20, 2024 14:51

This comment has been minimized.

@qqmyers qqmyers self-requested a review August 20, 2024 15:04
@qqmyers qqmyers self-assigned this Aug 20, 2024

This comment has been minimized.

This comment has been minimized.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good to me. I made some comments/responded to notes in the code, but other than any open @todos and some minor cleanup, it looks ready for QA to me. Goodbye to exec-ing curl!

I'll put request changes to avoid it switching columns until @landreev is ready, but probably no need for another review of changes.

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved
@@ -4034,6 +4035,8 @@ public Response addGlobusFilesToDataset(@Context ContainerRequestContext crc,
return wr.getResponse();
}

// @todo check if the dataset is already locked!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - could there be a constraint when trying to create a lock, e.g. only one lock of a given time per dataset rather than a separate check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant "of a given type", right? The addDatasetLock() already has a check for an existing lock of the same type (then it will just return the existing lock instead of creating a new one).
As for implementing this as a hard constraint - are we positive we'll never want multiple locks of the same type - some workflows maybe?
I addressed this specific situation with a separate lock check. We don't check for locks consistently in our APIs, I'm assuming under the assumption that it will be done when the the relevant commands are executed. But in this specific case, there is a potential for doing a huge amount of work before that UpdateDatasetVersionCommand is called in the end.

} catch (UnknownHostException ex) {
Logger.getLogger(DataverseTimerServiceBean.class.getName()).log(Level.SEVERE, null, ex);
}

if (timer.getInfo() instanceof MotherTimerInfo) {
logger.info("Behold! I am the Master Timer, king of all timers! I'm here to create all the lesser timers!");
logger.fine("Behold! I am the Master Timer, king of all timers! I'm here to create all the lesser timers!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to miss these :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still have it - at a low price of one FINE logging setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! It's like an old friend at this point. 😄

src/main/java/propertyFiles/Bundle.properties Outdated Show resolved Hide resolved

This comment has been minimized.

@landreev landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Sep 5, 2024
@landreev
Copy link
Contributor Author

landreev commented Sep 5, 2024

I resolved the merge conflicts yesterday. Also added some info under "How to Test".

Resolved conflicts:
	src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
(#10623)
@landreev
Copy link
Contributor Author

Resolved another merge conflict.

This comment has been minimized.

@cmbz cmbz added the FY25 Sprint 6 FY25 Sprint 6 label Sep 11, 2024
@pdurbin pdurbin self-assigned this Sep 23, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like my comments have been addressed. I still see the review dog complaint about tab chars in ExternalFileUploadInProgress - can that be fixed so we avoid the style fail?

This comment has been minimized.

@pdurbin
Copy link
Member

pdurbin commented Sep 23, 2024

@qqmyers good catch. I put in a commit to quiet down that yappy dog: 7b6f81e

I also made a small doc tweak: 2baf62e

Then I tested on internal. With the new flag off I was able to upload file. ✅

With the flag on, Globus is telling me "transfer complete" but the file is not in the dataset and the dataset has the "Globus Transfer in Progress" lock. ❌

@landreev is going to take a look.

Conflicts:
doc/sphinx-guides/source/installation/config.rst

This comment has been minimized.

This comment has been minimized.

@pdurbin pdurbin changed the title Improved handling of Globus uploads Improved handling of Globus uploads (experimental async framework) Sep 25, 2024
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10623-globus-improvements
ghcr.io/gdcc/configbaker:10623-globus-improvements

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@pdurbin
Copy link
Member

pdurbin commented Sep 25, 2024

Turned out to be a doc problem fixed now. Works great! Especially for an experimental feature. 😜

Also, API tests are passing as of the last code change: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10781/7/testReport/

Merging.

@pdurbin pdurbin merged commit d40ce32 into develop Sep 25, 2024
13 of 14 checks passed
@pdurbin pdurbin deleted the 10623-globus-improvements branch September 25, 2024 16:02
@pdurbin pdurbin added this to the 6.4 milestone Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 4 FY25 Sprint 4 FY25 Sprint 5 FY25 sprint 5 FY25 Sprint 6 FY25 Sprint 6 GREI 5 Use Cases Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
Status: Merged 🚀
Development

Successfully merging this pull request may close these issues.

Globus integration: further improvements/refactoring
6 participants