Skip to content

Data and Metadata Packaging for Archiving

qqmyers edited this page Jan 10, 2019 · 14 revisions

Introduction:

QDR is planning to submit copies of published datasets, in packaged archival form, to the Digital Preservation Network.

After some initial review of Dataverse's existing metadata export formats, relevant standards and recommendations, the internal structure of archival packages in DPN, the ability to add new metadata terms to Dataverse and the plans for additional provenance...

Our discussion at QDR led us to look at the combination of an OAI-ORE metadata format (serialized in json-ld) and a zipped BagIT bag as a potential format that would allow transfer of a single file per dataset to an archiving service (with the potential for multi-file or holey Bags as a means to support larger datasets). In order to ground our thinking and provide something concrete for discussion with the larger Dataverse community, we are exploring, on the feature/QDR-953 branch, a proof-of-concept implementation of an OAI-ORE map metadata export and the generation of a zipped BagIT bag data/metadata export.

To focus on these implementations rather than create a new mechanism to invoke their creation, I've temporarily added both to the Metadata Export menu as additional 'metadata' exporters (not so valid for the Bag as it contains data as well). (In doing so, I ended up implementing streaming output for exporters (as recommended in the comments) to be able to support the binary export of a zipped bag). Which will probably become a separate PR to Dataverse.)

Overall Plan:

QDR is interested in storing an archival copy of published datasets with DPN using their DuraCloud service. The intent is for this to be an automatic part of the publication process (versus a separate/optional step after publication or applicable to data in draft/review/other state). The archival copy would be for QDR's administrative use, i.e. community access to the dataset would continue to be through Dataverse and the archived copy would only be accessible to QDR staff and used as a means of validating and, if necessary, restoring the copy in Dataverse. To support this use, the archival copy must include fixity information and be metadata-complete (including all the information needed to recreate a dataset). QDR's incremental development plan involves:

  • Creating archival copies that can be manually submitted to DPN
  • Automating the submission of archival copies as part of the publication process
  • Providing status on the archival copy within Dataverse
  • Enabling validation of data and metadata between the published and archival copies (fixity-related validation for files and either fixity of the export format or content-based validation of metadata)

OAI-ORE and BagIT:

BagIT (specifically a complete, zipped Bag) and OAI-ORE (specifically the json-ld serialization) seem like a natural choice for use in creating an archival copy of a dataset. BagIT defines a folder structure for organizing data and metadata and requires a few files to provide some minimal descriptive and fixity information. BagIt does not limit the format of data files and allows for additional metadata - within required files or as separate files. OAI-ORE is, analogously, an extensible schema that addresses the basics of identifying an aggregation (e.g. a DV Dataset) and its aggregated resources (e.g. DV Datafiles) and separating concepts of metadata about an aggregation from the aggregation itself. As a semantic format, ORE has the advantage that terms from other schema can be added anywhere (versus an XML structure like METS where extensions are limited to specific points in the hierarchical structure.) Using json-ld for ORE helps improve readability and makes it easy to leverage json libraries already included in DV for generating it.

Mapping a Dataset to OAI-ORE

The basic mapping of a DV Dataset as an ORE aggregation that 'aggregates' DV Datafiles is relatively obvious. ORE then allows metadata to be recorded about the 'ORE map' - the document being generated to describe the aggregation and its contents, the aggregation/Dataset, and each file. Since the ORE map is being created, there is no stored metadata for it and required/desired properties can be dynamically generated or hardcoded. So, for example, creation and modification dates can be set to 'now', the creator can be set to the institution running the Dataverse instance, the license (just for use of the map) can be set to CC0 or similar. The map 'describes' a Dataset that has all of the DV Dataset metadata. This includes basic information such as the title and creation date to metadata defined in .tsv files and entered by the user in the DV 'add/edit metadata' form. While most of this information is not required by ORE and could be entered using an vocabulary, it is valuable to use a standard vocabulary such as Dublin Core or schema.org's Dataset terms. In our proof-of-concept, we've chosen schema.org terms for basic information. For information in tsv files, it may be best longterm to allow tsv file creators to specify the URI to use for each term. In the meantime, I've chosen to generate terms in terms of a custom vocabulary using a URI constructed from the Dataverse instance URI, the name of the tsv file and the term name (and any parents), as with https://dv.dev-aws.qdr.org/schema/citation/publication#publicationCitation. Data files have relatively limited metadata including title, size, format, hash value, etc. Some of these are fairly DV specific, e.g. the storageidentifier and datasetVersionId. As shown in the proof-of-concept, all of these can be included in an exported OAI-ORE map. Published datasets from Dataverse have a ready-made identifier - the DOI or Handle created for the dataset. With the 4.9 release of DOIs for files, published files will also have DOI identifiers available. In the proof-of-concept, I created a globally unique URI identifier using the access API URL for the file, e.g. https://dv.dev-aws.qdr.org/api/access/datafile/93 . While a more opaque id would be desirable, this is a valid choice and it is useful to know a URL from which the bytes of the file can be retrieved (as discussed in the next section, I've used these URLs to retrieve the files for the Bag which requires only having the ORE map file). When DOIs for file are available, I plan to make the access URL available, using a term indicating a 'distribution' of the datafile.

Mapping a Dataset to BagIt

The basic mapping of a Dataset into a BagIt structure simply involves placing the data files in the ../data directory and adding required metadata in the required metadata files. Those include a bag-info.txt file that includes basic information about the title, description, source organization, bagging date, etc. and a manifest-.txt file that lists the hash associated with each file in the bag using an algorithm such as md5. There are several other files, some defined in the BagIT spec and others recommended elsewhere that can enhance the mapping. First, including the ORE map as a metadata file, as done by the SEAD and DataOne DataNet projects, and as in the RDA's recent repository interoperability recommendation, is a way to include 'all' metadata. One issue that has to be addressed in doing this is that the ORE map identifies files by ID whereas bag files such as manifest-.txt refer to files via their path within the bag. In the proof-of-concept, I've followed the practice of DataOne (also adopted by SEAD) of adding a pid-mapping.txt file that maps IDs to paths, one entry per line, including the ID of the dataset mapping to a folder name within the ../data directory. Alternative approaches are possible, such as including path info in the ORE map file, but since this is specific to Bag formatting rather than intrinsic metadata, a separate file seems to be a better approach. One issue that has been addressed by SEAD (and DataOne in a different way) that is not yet important for Dataverse, is how to handle folder structure within a Dataset. Nominally Bags can directly include such structure - data files are simply placed in folders mirroring the dataset structure. However, while the ORE standard allows an aggregation to aggregate another aggregation, it prohibits embedding information about the aggregated aggregation in the parent's ORE map file (each aggregation must have its own map file). DataOne has accepted this limitation and breaks single datasets into multiple ORE aggregations stored in separate Bags. SEAD, seeking to avoid breaking one logical dataset into multiple archival items, adopted a convention of considering a dataset with folders structure as being a flat aggregation of resources representing files and folders and encoding the logical hierarchy using a different vocabulary (DC:hasPart). Thus a dataset is represented in an ORE map as a flat list of items with the dataset and folders having metadata indicating their direct descendants. While this may seem like a kludge, it actually begins to make more sense if one considers that datasets may include additional relationships such as file versions (linear or branched), and provenance that may cross folder boundaries. In such cases, treating the folder hierarchy as more fundamental and potentially breaking datafiles that form a provenance chain into separate ORE aggregations seems more problematic. It was this perspective that led SEAD to develop its approach (which has worked in cases up to 100K+ files with 12+ levels of folder hierarchy and 600GB+ in total data volume). In the proof-of-concept here, I chose to add a separate schema.org/Dataset hasPart relationship that, since current DV Datasets have no folders, simply mirrors the flat structure of the aggregation itself but opens the door for more structured datasets going forward.

Mapping Datasets to DPN

DPN has defined a profile/variant of BagIt for their use, which is used in packaging files uploaded through the DuraSpace user interface. Relevant features of DPN's bags:

  • Earlier versions of the BagIt spec (i.e. 0.97) required a zipped Bag to be named using the identifier of the item being bagged and for that same name to be used as the top-level directory within the bag. DPN has chosen to use the internal DPN UUID for the base directory (versus an externally assigned DOI for example)
  • DPN has selected sha256 as the required algorithm for the manifest (manifest-sha256.txt). BagIt 1.0 recommends sha512 but allows sha256.
  • DPN has decided to require the optional tag-manifest-.txt file which includes the hash values necessary to perform a fixity check on metadata files (manifest-.txt just covers the /data directory contents).
  • DPN has added a dpn-tags sub-directory and defined a dpn-info.txt file with some DPN sepcific info such as metadata about the ingest node providing the material.
  • DPN allows additional tag directories that may be specific to individual nodes in DPN.
  • DPN requires the existence of some optional fields in bag-info.txt but allows their value to be empty.

The structure of DPN bags suggests a few basic potential approaches to integration:

  • the API could be used to submit datafiles and dataset metadata to DPN which would build a zipped bag. To be useful, this approach would have to allow submission of additional files such as the ORE map and pid-info.txt files discussed above.
  • the API could be used to submit the entire Bag structure (unzipped), causing DPN to create a Bag around the unzipped Bag provided. This adds a layer of indirection, but would keep the metadata and datafiles visible in DPN (in comparison with the next options).
  • a zipped bag generated in Dataverse could be augmented with dpn specific information and submitted as a direct replacement for the bag it would have generated via GUI submission of datafiles. For this to work, DPN would have to accept such Bags (which I've heard is true but have not verified), and we would have to be able to retrieve ingest node metadata required for the Bag or to treat Dataverse as an ingest node.
  • a zipped Bag could be submitted as the sole content file to be added to a DPN-generated DPN bag. This would result in a layer of indirection/wrapping but would minimize the coupling between the systems. There are some actions, such as implementing sha256 hash values as an option in DV and using that on projects that wish to submit to DPN that would probably be worthwhile for future validation efforts, regardless of which option is pursued. So far, we have not tried to implement any of these options.

Feedback Welcome

QDR is interested in creating a robust solution that meets its needs while hopefully providing a general solution that could help others in the DV community. Towards that end, we're very interested in getting feedback on the conceptualization here as well as the practical aspects of implementation as described here and evident in the proof-of-concept ORE and Bag implementations available in the repository.

Initial feedback is being requested via a Post to Dataverse Community. Our intent is to follow up by email and/or specific issues on github.

##Updates since the initial proof-of-concept

  • Use of namespaces for ORE, DCterms, schema.org, and 'core' Dataverse terms, as well as for metadata block elements
  • Generation of sha256 hashes during bagging
  • Inclusion of Dataset- and Datafile-level terms of use/access (all possible entries from Dataverse forms)
  • Update to BagIt 1.0 from 0.97
  • calculation of basic stats including total data size, # of data files, list of data file mimetypes during bagging (added to oremap, total size still reported in the bag-info file)
  • use of static terms and translations to simplify updating the mapping of terms during development and potentially to allow dynamic configuration of mappings
  • Minor bug fixes (e.g. use of relative paths in manifest-*.txt and pid-mapping files, Dataverse bug Datafield.isEmpty(false) not working) and cleanup (e.g. don't use arrays for single values, retrieve source organization name, address, email from Bundle instead of hard-coding, add contact info to baginfo.txt).

Sample files (updated 9/26/2018):