Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL/7493 Batch Archiving #8610

Merged
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
bf7b558
Merge and revert BagGenerator/Duracloud archiver changes in other PRs
qqmyers Apr 13, 2022
bc63cf8
Merge remote-tracking branch 'IQSS/develop' into
qqmyers May 24, 2022
b9d3fd1
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers May 26, 2022
9a90fe1
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers May 27, 2022
059bdfe
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Jun 26, 2022
267297c
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Jul 15, 2022
4c0d891
restore batch command
qqmyers Jul 15, 2022
6ea4878
drop line removal
qqmyers Jul 15, 2022
5f4d965
drop superuser req as this is admin and command already requires perm
qqmyers Jul 15, 2022
2357dd2
add doc for batch archiving command
qqmyers Jul 15, 2022
e2bb433
clarify archival bag language
qqmyers Jul 15, 2022
df05c0e
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Jul 21, 2022
5d3f6f5
Change to named query
qqmyers Jul 21, 2022
5fe18a7
fix toDos re: createDataverseRequest()
qqmyers Jul 21, 2022
86162de
remove call to set session user
qqmyers Jul 21, 2022
408a51f
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Jul 21, 2022
ccb8653
use class
qqmyers Jul 25, 2022
4484c61
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Jul 25, 2022
17baefc
Add comment per review
qqmyers Jul 25, 2022
82d9e45
update/remove obsolete comments
qqmyers Jul 26, 2022
45ad628
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Jul 28, 2022
abd3923
updates for archival status (missed/lost)
qqmyers Jul 29, 2022
53dc116
typo
qqmyers Jul 29, 2022
e4a228d
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Aug 3, 2022
c918e64
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Aug 4, 2022
2eed04b
don't archive harvested datasets
qqmyers Aug 5, 2022
bae7011
lower list-only logging to fine
qqmyers Aug 5, 2022
4215ec5
doc /api response changes per QA
qqmyers Aug 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 23 additions & 11 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1065,7 +1065,9 @@ BagIt file handler configuration settings:
BagIt Export
------------

Your Dataverse installation may be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_ or alternately to any folder on the local filesystem.
Your Dataverse installation may be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ archival Bags (sometimes called BagPacks) to `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_ or alternately to any folder on the local filesystem.

These archival Bags include all of the files and metadata in a given dataset version and are sufficient to recreate the dataset, e.g. in a new Dataverse instance, or postentially in another RDA-conformant repository.

The Dataverse Software offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse Software web interface.

Expand All @@ -1076,7 +1078,7 @@ At present, the DPNSubmitToArchiveCommand, LocalSubmitToArchiveCommand, and Goog
Duracloud Configuration
+++++++++++++++++++++++

Also note that while the current Chronopolis implementation generates the bag and submits it to the archive's DuraCloud interface, the step to make a 'snapshot' of the space containing the Bag (and verify it's successful submission) are actions a curator must take in the DuraCloud interface.
Also note that while the current Chronopolis implementation generates the archival Bag and submits it to the archive's DuraCloud interface, the step to make a 'snapshot' of the space containing the archival Bag (and verify it's successful submission) are actions a curator must take in the DuraCloud interface.

The minimal configuration to support an archiver integration involves adding a minimum of two Dataverse Software Keys and any required Payara jvm options. The example instructions here are specific to the DuraCloud Archiver\:

Expand All @@ -1100,7 +1102,7 @@ It also can use one setting that is common to all Archivers: :BagGeneratorThread

``curl http://localhost:8080/api/admin/settings/:BagGeneratorThreads -X PUT -d '8'``

By default, the Bag generator zips two datafiles at a time when creating the Bag. This setting can be used to lower that to 1, i.e. to decrease system load, or to increase it, e.g. to 4 or 8, to speed processing of many small files.
By default, the Bag generator zips two datafiles at a time when creating the archival Bag. This setting can be used to lower that to 1, i.e. to decrease system load, or to increase it, e.g. to 4 or 8, to speed processing of many small files.

Archivers may require JVM options as well. For the Chronopolis archiver, the username and password associated with your organization's Chronopolis/DuraCloud account should be configured in Payara:

Expand All @@ -1117,7 +1119,7 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam

``curl -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.LocalSubmitToArchiveCommand" http://localhost:8080/api/admin/settings/:ArchiverClassName``

\:BagItLocalPath - the path to where you want to store BagIt. For example\:
\:BagItLocalPath - the path to where you want to store the archival Bags. For example\:

``curl -X PUT -d /home/path/to/storage http://localhost:8080/api/admin/settings/:BagItLocalPath``

Expand All @@ -1132,7 +1134,7 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam
Google Cloud Configuration
++++++++++++++++++++++++++

The Google Cloud Archiver can send Dataverse Project Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access)
The Google Cloud Archiver can send archival Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access)

``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.GoogleCloudSubmitToArchiveCommand"``

Expand All @@ -1158,23 +1160,33 @@ For example:

.. _Archiving API Call:

API Call
++++++++
API Calls
+++++++++

Once this configuration is complete, you, as a user with the *PublishDataset* permission, should be able to use the API call to manually submit a DatasetVersion for processing:
Once this configuration is complete, you, as a user with the *PublishDataset* permission, should be able to use the admin API call to manually submit a DatasetVersion for processing:

``curl -H "X-Dataverse-key: <key>" http://localhost:8080/api/admin/submitDataVersionToArchive/{id}/{version}``
``curl -X POST -H "X-Dataverse-key: <key>" http://localhost:8080/api/admin/submitDatasetVersionToArchive/{id}/{version}``

where:

``{id}`` is the DatasetId (or ``:persistentId`` with the ``?persistentId="<DOI>"`` parameter), and

``{version}`` is the friendly version number, e.g. "1.2".

The submitDataVersionToArchive API (and the workflow discussed below) attempt to archive the dataset version via an archive specific method. For Chronopolis, a DuraCloud space named for the dataset (it's DOI with ':' and '.' replaced with '-') is created and two files are uploaded to it: a version-specific datacite.xml metadata file and a BagIt bag containing the data and an OAI-ORE map file. (The datacite.xml file, stored outside the Bag as well as inside is intended to aid in discovery while the ORE map file is 'complete', containing all user-entered metadata and is intended as an archival record.)
The submitDatasetVersionToArchive API (and the workflow discussed below) attempt to archive the dataset version via an archive specific method. For Chronopolis, a DuraCloud space named for the dataset (it's DOI with ':' and '.' replaced with '-') is created and two files are uploaded to it: a version-specific datacite.xml metadata file and a BagIt bag containing the data and an OAI-ORE map file. (The datacite.xml file, stored outside the Bag as well as inside is intended to aid in discovery while the ORE map file is 'complete', containing all user-entered metadata and is intended as an archival record.)

In the Chronopolis case, since the transfer from the DuraCloud front-end to archival storage in Chronopolis can take significant time, it is currently up to the admin/curator to submit a 'snap-shot' of the space within DuraCloud and to monitor its successful transfer. Once transfer is complete the space should be deleted, at which point the Dataverse Software API call can be used to submit a Bag for other versions of the same Dataset. (The space is reused, so that archival copies of different Dataset versions correspond to different snapshots of the same DuraCloud space.).

A batch version of this admin api call is also available:

``curl -X POST -H "X-Dataverse-key: <key>" http://localhost:8080/api/admin/archiveAllUnarchivedDatasetVersions?listonly=true&limit=10&latestonly=true``

The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters.
* listonly=true will cause the API to list dataset versions that would be archived but will not take any action.
* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get how this works, but what's the reason to limit this way? (counting both successes and failures)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listonly=true gives you a list, so with limit working this way you can make sure that only the things you listed will get processed when you drop listonly=true.
Overall, the concern is about load, particularly if/when something is misconfigured and everything will fail after all the work to create a bag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~yes - it's the new named query that is getting the list so unless something affects the return order from that (which I think will go in id order by default), it wouldn't change.

* latestonly=true will limit archiving to only the latest published versions of datasets instead of archiving all unarchived versions.


PostPublication Workflow
++++++++++++++++++++++++

Expand Down Expand Up @@ -2568,7 +2580,7 @@ Number of errors to display to the user when creating DataFiles from a file uplo
.. _:BagItHandlerEnabled:

:BagItHandlerEnabled
+++++++++++++++++++++
++++++++++++++++++++

Part of the database settings to configure the BagIt file handler. Enables the BagIt file handler. By default, the handler is disabled.

Expand Down
9 changes: 9 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@
import javax.persistence.Index;
import javax.persistence.JoinColumn;
import javax.persistence.ManyToOne;
import javax.persistence.NamedQueries;
import javax.persistence.NamedQuery;
import javax.persistence.OneToMany;
import javax.persistence.OneToOne;
import javax.persistence.OrderBy;
Expand All @@ -60,6 +62,13 @@
*
* @author skraffmiller
*/

@NamedQueries({
@NamedQuery(name = "DatasetVersion.findUnarchivedReleasedVersions",
query = "SELECT OBJECT(o) FROM DatasetVersion AS o WHERE o.releaseTime IS NOT NULL and o.archivalCopyLocation IS NULL"
)})


@Entity
@Table(indexes = {@Index(columnList="dataset_id")},
uniqueConstraints = @UniqueConstraint(columnNames = {"dataset_id,versionnumber,minorversionnumber"}))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1195,4 +1195,24 @@ private DatasetVersion getPreviousVersionWithUnf(DatasetVersion datasetVersion)
public DatasetVersion merge( DatasetVersion ver ) {
return em.merge(ver);
}

/**
* Execute a query to return DatasetVersion
*
* @param queryString
* @return
*/
public List<DatasetVersion> getUnarchivedDatasetVersions(){

try {
List<DatasetVersion> dsl = em.createNamedQuery("DatasetVersion.findUnarchivedReleasedVersion", DatasetVersion.class).getResultList();
return dsl;
} catch (javax.persistence.NoResultException e) {
logger.log(Level.FINE, "No unarchived DatasetVersions found: {0}");
return null;
} catch (EJBException e) {
logger.log(Level.WARNING, "EJBException exception: {0}", e.getMessage());
return null;
}
} // end getUnarchivedDatasetVersions
} // end class
Loading