Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align or merge DataCite metadata exports #5889

Open
jggautier opened this issue May 28, 2019 · 37 comments · Fixed by #10632
Open

Align or merge DataCite metadata exports #5889

jggautier opened this issue May 28, 2019 · 37 comments · Fixed by #10632
Assignees
Labels
Feature: Harvesting Feature: Metadata GREI 2 Consistent Metadata NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... pm.GREI-d-1.5.1 NIH, yr1, aim5, task1: Standardize download metrics pm.GREI-d-1.5.2 NIH, yr1, aim5, task2: WG with other repositories to follow Make Data Count recommendations Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Suggestion an idea

Comments

@jggautier
Copy link
Contributor

jggautier commented May 28, 2019

This issue is meant to record the differences between Dataverse's two newest metadata exports as of v4.14, "DataCite"/"Datacite" and "OpenAIRE"/"oai_datacite", and discussion about how to align (or possibly merge) the very similar exports.

As part of v4.10 (released in Dec. 2018), Dataverse makes available through the UI, API and over OAI-PMH dataset metadata in the DataCite schema (#5043). This lets Dataverse export dataset metadata in a widely-used, discipline-agnostic schema that's more standardized than Schema.org and has more metadata than Dublin Core.

As part of v4.14 (released in May 2019), Dataverse makes available through the UI, API and over OAI-PMH DataCite metadata that complies with OpenAIRE requirements (#4257). Repositories need to follow these requirements in order for their dataset metadata to be made discoverable (harvested) by OpenAIRE (OpenAIRE EXPLORE). The OpenAIRE metadata requirements follow the DataCite schema, with some differences between OpenAIRE and DataCite listed in their documentation.

What both exports are called depending on the export method:

openaire-datacite-export-graphic

Both metadata exports are based on DataCite 4 and are meant to be valid against the DataCite 4 schema (although the xml records available over OAI-PMH in "Datacite" format reference DataCite's 3.1 schema). But Dataverse exports them as separate formats for several reasons:

  • The two metadata exports were worked on at different times by different groups
  • When work on making Dataverse OpenAIRE compliant started, I thought the OpenAIRE export would follow the DataCite 3.1 schema since the OpenAIRE guidelines for data repositories follows DataCite 3.1. And I knew that Dataverse would eventually export DataCite 4 metadata, so it made sense to make them separate exports. But we're told the OpenAIRE folks plan to update their guidelines, so our 4Science colleagues created the OpenAIRE export following the DataCite 4 schema. (For example, a notable difference between DataCite 3 and 4 is how funder information is handled. The OpenAIRE guidelines mandate that the contributorType property is used, which is how DataCite 3 handles funder info. But Dataverse's OpenAIRE export is using the DataCite 4 fundingReferences property instead.)
  • The "OpenAIRE" metadata export uses an algorithm that adds metadata about whether dataset authors and contact persons are people or organizations (in DataCite's nameType attribute). The algorithm was the last thing discussed in the OpenAIRE GitHub issue.

Ideally, Dataverse would export only one metadata record, made available through the UI, API and over OAI-PMH, that follows the DataCite schema and is also OpenAIRE compliant. The way things are now, where Dataverse exports two different metadata records based on DataCite but different, people have been confused about the differences between the two metadata exports called "DataCite" and "OpenAIRE" in the UI and called "Datacite" and "oai_datacite" in the API endpoints and made available over OAI-PMH.

But we may want to maintain two metadata exports because:

  • the OpenAIRE export is using the nameType algorithm, which was tested during QA but only tested for evidence that the algorithm would work in at least some cases. We haven't tried to estimate how often it will correctly figure out if author/contact names of actual datasets are people or organizations (although it's based on an algorithm DataCite uses that we're told is right over 90% of the time). Would people want to be able to export or harvest metadata that does not include the nameType metadata (maybe because they find that it's not correct often enough)?
  • the OpenAIRE export uses one of four mandatory Access Rights terms. The rules that Dataverse uses to determine this are discussed in a GitHub issue comment. But I realized recently that the rules are too simple and lead to cases where datasets are marked as closedAccess when restricted access is more appropriate (e.g. https://doi.org/10.7910/DVN/0PMZC6, where file request is disabled, but people can request access through a process that happens outside of Dataverse). A GitHub issue about this is opened (Access Rights metadata in OpenAIRE metadata export is being misapplied #5920), so we can figure out how to assign more appropriate access rights to datasets. Until then, would people want to be able to export or harvest metadata that does not include these sometimes misleading Access Rights?

We should decide if:

  • Dataverse should maintain one export or two and
  • If maintaining only one export, make sure that it has all of the metadata available in the current two exports.
  • If maintaining two exports, make sure that the amount of metadata in one export is as close to the same amount in the other (and continues to be as synced as possible) and document what the differences are. (As of v4.14 the "OpenAIRE" export has more metadata than the "DataCite" export but there are things missing in both.)
@mheppler
Copy link
Contributor

Related? Silent publishing failure when not all fields required by Datacite are present #7551

@jggautier
Copy link
Contributor Author

jggautier commented Feb 10, 2021

Good point. It could be related if/when Dataverse repositories start sending more metadata to DataCite and the dependencies among the child fields of any of that metadata is the same as the dependencies of the child fields in the Producer compound field (which right now is the only field causing those silent failures).

@adam3smith
Copy link
Contributor

@qqmyers and I are also looking at this given that what we're currently sending to DataCite is indeed rather inadequate.
Looking at the Crosswalk Julian put together, it seems to me that the current OpenAIRE export is strictly better. The only field I'm seeing where DataCite has something and OpenAire doesn't is Name Identifier schemeURI and that's either just not documented or an oversight that should be fixed.

I'm not at all concerned about the naming algorithm. If anything, I think it's a good idea to try to guess organizational names.
I think the closed vs. restricted data categorization is something that should get addressed, I don't see it as a blocker.

Given this, I think a single export format makes sense.

In terms of items missing from both exports, the citation metadata looks complete, but the individual subject blocks seem to have some stuff missing. From @philippconzett 's list at #7072 that's most notably the geography data, which we'd also like to capture.

We're viewing this as pretty high priority given how widely DataCite data are used (e.g. the fact that we're not linking up our funding information to the PID graph isn't great) -- is there anything we can do to help move this along?

@djbrooke
Copy link
Contributor

djbrooke commented Mar 9, 2021

Thanks @adam3smith.

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 9, 2021

Is my #7077 related here, too? (Going to work on that, you folks know... Funding...)

@djbrooke
Copy link
Contributor

djbrooke commented Mar 9, 2021

@poikilotherm May be related, but I think we'd want to move these forward independently IMHO. I think much of the discussion around #7077 will happen as part of the Software Metadata WG.

@adam3smith
Copy link
Contributor

Awesome! @jggautier -- I think you have this covered, but if there's anything you'd like another set of eyes on or a 2nd opinion just tag and/or email me.

@jggautier
Copy link
Contributor Author

Thanks @adam3smith. Great to hear there's more interest in prioritizing this! I'm all on board with saving the closed vs. restricted data categorization problem (#5920) for another day if it moves this issue forward. I think there are a few other things we should consider:

  • Is there any reason why repositories wouldn't like the nameType algorithm? Is there a way to test how well it's been working generally and for certain types of names? @adam3smith or @qqmyers, would you happen to know how DataCite figured out that the algorithm they use works 90% of the time? Or should we include the nameType algorithm and later on, separate from this, figure out how well it's working?
  • When more metadata is sent to DataCite, we should make sure we don't run into the compound field dependency issues that the Producer metadata field had (discussed in Silent publishing failure when not all fields required by Datacite are present #7551). (For example, the OpenAIRE export deals with missing Related Publication metadata by including it only if certain fields are filled.)
  • The OpenAIRE export uses the IsCitedBy relationship when including metadata from the Related Publication field. We never really resolved how to use DataCite's relation terms (discussed in Add "Relation Type" to related publication metadata fields to send DataCite related publication metadata #2778). I think we could:
    • Work out how to allow depositors to define different types of relationships between their datasets and related text-based publications (like articles) and/or make it easier for repositories to choose what types of relationships they want their depositors to use. This might involve UI changes.
    • Decide with the Dataverse community which one relation term to use and expect people and other systems (harvesters, indexers, etc) to interpret that term very broadly (like "this publication is somehow related to this dataset"). Then I think this term could logically be applied to the Related Publication metadata in datasets that Dataverse repositories have already published.
    • Decide with the Dataverse community to use one term that we define more narrowly (like "this dataset is cited by this publication"). But does it make sense to apply that term to the metadata of existing datasets? Not all repositories know what types of relationships their depositors had in mind when entering Related Publication metadata. I'd guess a majority of the time, the dataset is used to support findings/conclusions made in an article, but the article may not be citing the dataset. Could there be other reasons why a dataset is associated with something like a journal article? And will people and other systems ever care about/rely on the differences between the relationship types? (I think for MakeDataCount, the answer right now is no: when citations are counted, any one of several types of relation terms are valid because repositories are using the terms in different ways, so the standard's designers don't want to be too strict about which relation term or terms signal a "citation".)
    • Not include Related Publication metadata in this new, merged DataCite metadata export and tackle Add "Relation Type" to related publication metadata fields to send DataCite related publication metadata #2778 separately.

@adam3smith
Copy link
Contributor

Thanks Julian.

@mfenner
Copy link

mfenner commented Mar 10, 2021

Users can set Personal or Organizational authors via nameType. Otherwise DataCite is doing the following:

  • if there is an ORCID associated with the author, it is a person
  • if there is a givenName, it is a person
  • if the creatorName has something that looks like a givenName, and that givenName is in a dictionary of known given names (using https://github.com/berkmancenter/namae), it is a person. This is where the 90% comes from. The dictionary is not so good in non-European names, and there are organization names that contain a given name (e.g. "Alfred P. Sloan Foundation").

@adam3smith
Copy link
Contributor

Thanks! Dataverse currently doesn't have a nameType option, which is why we need some sort of algorithmic solution to determine this.

  • The ORCID option make sense
  • Since Dataverse doesn't have separate given/family name fields, I'm guessing the option here is to use the presence of a comma as a heuristic (that's what Zotero would do on import and it generally works pretty well. The problem is that this will have a fair number of false positives with non-Western names, as it's common to enter names without comma and often in familyname/givenname order (e.g., Mao Zedong)

Since the name list also sounds like it works less well for non-Western names, I'd actually now be somewhat nervous about this. Do you have contacts at some of the Chinese DV installations we could ask or are there Dataverse Collections at Harvard more likely to contain non-Western creator names so we could check?

If this is indeed fairly common, labeling a significant number of people with non-Western names as institutions seems a lot more problematic than the reverse and I'd go back on my opinion above...

@mfenner
Copy link

mfenner commented Mar 10, 2021

The presence of a comma is unfortunately not a good heuristic for DataCite, as many repositories use "givenName familyName", instead of "familyName, givenName".

The best solution is really using givenName and familyName. The reason we use a name dictionary is mainly that adoption of givenName/familyName is too low.

@adam3smith
Copy link
Contributor

adam3smith commented Mar 10, 2021

Just to be clear -- what we're after here is not to change what Datacite does but what Dataverse does in creating metadata submitted to Datacite -- Datacite just comes in because the Dataverse algorithm for handling names is derived from your code.

I think adding separate name fields would be quite challenging at this point, though I agree that it'd be much preferable.

@mfenner
Copy link

mfenner commented Mar 10, 2021

I understand. One important reason for "guessing" personal names is citation styles and formatted citations (as you of course know). DataCite introduced givenName and familyName a few years ago and it is still optional as it is indeed challenging to implement.

@jggautier
Copy link
Contributor Author

jggautier commented Mar 10, 2021

Thanks @mfenner as always!

@adam3smith, there was a lot of discussion in #4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at #4257 (comment) still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.

Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smith, @qqmyers, @djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?

@adam3smith wrote:

There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?

I agree and spoke with @scolapasta about the use cases and limits of #7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.

The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.

@djbrooke wrote:

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation work starts:

  • a review of the metadata mapping. Like @adam3smith wrote, that shouldn't be too much trouble
  • a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if a creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

@mfenner
Copy link

mfenner commented Mar 10, 2021

One small comment: the author of the library we use for names (https://github.com/berkmancenter/namae) is @inukshuk who @adam3smith knows from citationstyles work, maybe it is worth reaching out to him, e.g. to ask about handling of non-Western names.

@qqmyers
Copy link
Member

qqmyers commented Mar 10, 2021 via email

@adam3smith
Copy link
Contributor

  • a review of the metadata mapping. Like @adam3smith wrote, that shouldn't be too much trouble
  • a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if an creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

That sounds good to me.

It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way.

We'd be happy with this -- the more control we have over metadata the better -- but there may be concern about too many UI elements for self-deposit repositories.

@abollini
Copy link
Contributor

Sorry for joining the discussion so late, I just want to add a reference to the inprogress update to the OpenAIRE DataArchive guidelines that will be based on the Datacite version 4 schema https://openaire-guidelines-for-data-archive-managers.readthedocs.io/en/latest/index.html

This is essentially the new version of the guidelines that we were requested to develop for in 2018 (to be more specific at this time we have looked to the Datacite schema v4.1) and was contributed to Dataverse in 4.14

The OpenAIRE team is still working on the new version, I take the freedom to ping them on this thread openaire/guidelines-data-archives#2 so that they will be aware of the work in progress on the Dataverse community

@jggautier
Copy link
Contributor Author

Hi @abollini. I don't think you're late at all. The status of this issue was brought up in a recent Dataverse community meeting, so I thought it would be helpful to write here that the plans being discussed in this GitHub issue for how to proceed haven't been started or finalized. I think it's great that the OpenAIRE team will be aware of this discussion. Thanks!

@jggautier
Copy link
Contributor Author

jggautier commented Feb 24, 2022

Just noticed that in the DataCite export's of installations running Dataverse software v5.9 and maybe all earlier versions, parentheses are added to the Author Affiliation values that are put in DataCite's creator > affiliation element:

Screen Shot 2022-02-24 at 12 42 02 PM

The screenshot is from an export from Demo Dataverse, running v5.9. It's also done in this export from DataverseNL (v5.9)

Maybe this is because the code is getting what's displayed on the dataset page instead of what's entered in the field on the edit metadata page? Looks like that was the issue when Author Affiliation values were wrapped in parenthesis in the search API results (#6570 (comment))

The OpenAIRE export doesn't include the parenthesis, so I mention this bug in this issue since it seems natural that merging these two exports, or aligning them more, would also fix this parenthesis bug.

@cmbz cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Apr 4, 2024
@jggautier
Copy link
Contributor Author

jggautier commented Apr 5, 2024

Just an update about what I wrote last week about consulting @abollini about the related comments he left in a GitHub issue at openaire/guidelines-data-archives#2.

In that issue I commented to let @abollini know about this proposal to merge the two exports, asked for feedback about having the merged export's schemaLocation point to the xsd of DataCite's 4.5 schema, and asked for more information about bringing "arguments from the OpenAIRE team" to this effort.

@adam3smith
Copy link
Contributor

Thanks Julian -- we'd be very happy to see this merged and I think it'd have significant downstream benefits to improve the Dataverse-deposited metadata with Datacite this way.

@poikilotherm
Copy link
Contributor

I'm still having this crazy idea about generating model classes from the Schema XSDs and create mappers from our internal metadata model to the target model...

@jggautier
Copy link
Contributor Author

Hey @poikilotherm, would this be a better way to change the exports? Would it take a lot of time to do?

@sbarbosadataverse
Copy link

sbarbosadataverse commented Apr 9, 2024

Ceilyn and Sonia priorized and moved to sprint ready as part of GREI Y3 planning @jggautier @scolapasta Please weigh in if you have objections.

poikilotherm added a commit that referenced this issue Apr 10, 2024
- Generate model classes for DataCite 4.5 Metadata Schema
- Add a simple test to demonstrate usage and basic validity.
@poikilotherm
Copy link
Contributor

poikilotherm commented Apr 10, 2024

@jggautier I put together a very simple demonstrator for the generator part, using the DataCite 4.5 Kernel. (It does not include the mapper part, where we map our internal to the generated model. I could create an example exporter for that if you want.) To run the example, use this:

git clone --branch 5889-gen-schema-pojos https://github.com/IQSS/dataverse.git dataverse
cd dataverse
mvn -f modules/dataverse-schemas package

Aside from that, here's the comparison: https://github.com/IQSS/dataverse/compare/5889-gen-schema-pojos

@jggautier
Copy link
Contributor Author

Thanks @sbarbosadataverse. I don't have any objections to this being prioritized and moved to sprint ready. I'm worried we won't hear back from folks from OpenAIRE by the end of the sprint next Wednesday. I'll reach out to @abollini again in openaire/guidelines-data-archives#2

@poikilotherm I'm hesitant to try to better understand what generators are. But could you write about the benefits? For example, does it make it easier to change the exporters?

@poikilotherm
Copy link
Contributor

poikilotherm commented Apr 15, 2024

Currently, for DataCite we use a template approach, combined with XML processing. For DDI we use AFAIK an XML only processing approach. For our JSON based exports we use mostly JSON processing.

The point is: all of this is hand crafted. The implementation is done by us and we need to make sure the serialized output matches the specifications involved. We also provide the mapping from our internal model to the target model with these serializers.

When using generators, parts of the process are put upside down. You start with the spec (XML XSD, Json Schema, Open API...) and you use a tool to generate model classes out of these.

The result are classes that can be serialized to the target output data using the Jakarta standard included data binding mechanisms. Beyond that, these classes can also be used for the inverted process: deserialization from some data to the model. An example would be importing DataCite XML from OAI-PMH: use the data binding to get a populated Java model of the data.

As the model classes are generated from the spec, they are known to fully transform all of the spec into the model. We might not use all of the available modeling, but at least we can easily extend without much hassle.

As long as the generator tools don't make mistakes, the data binding is always going to be valid output data as well as always map from correct input data back to the model.

Using our own implementations for de-/serialization requires extensive testing and also lot of manual work to implement every change etc.

The availability of schemas and model classes for them allows a much stricter enforcing of data validity at compile and runtime. Constraints about the data from the spec are transported into the data model, allowing for simpler interaction with the model from code as well as the Java compiler assisting you to build it.
Example: most generators will allow you to create a Fluent API for the model.

For the exporters, having schemas around (and I'm talking about more than just DataCite) will also allow for a clearer defined data exchange between the core application and plugged in exporters. The model classes provide Data Transfer Objects as a side product.

Also, upgrading schemas is improved. We can include a generated data model version for any version of a schema. If we want to change the supported schema version, the Java code can help us determine what to change and how. It's much clearer in code what is supported and what isn't. Changing a version means change the import path for them classes.

Brain dump out.

@jggautier
Copy link
Contributor Author

@cmbz asked me to add a status update to this GitHub issue. There's discussion and related work in the pull request at #10615 that addresses at least some of what's been proposed in this GitHub issue.

@DS-INRA
Copy link
Member

DS-INRA commented Jul 22, 2024

Another related issue :

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
@pdurbin
Copy link
Member

pdurbin commented Aug 20, 2024

This issue has an open PR...

... so I'm reopening it. It'll be closed when we merge it.

@pdurbin
Copy link
Member

pdurbin commented Sep 17, 2024

We're now using this PR instead to close this issue:

@jggautier
Copy link
Contributor Author

Thanks for the heads up @pdurbin. I'm going to keep this issue open, or I guess re-open it after that PR is merged, so that I can see what decisions were made and what goals and questions aren't addressed yet.

@pdurbin
Copy link
Member

pdurbin commented Sep 18, 2024

@jggautier sounds good. Perhaps we can create a new issue with any remaining items.

@pdurbin pdurbin added this to the 6.4 milestone Sep 23, 2024
@jggautier jggautier reopened this Sep 24, 2024
@pdurbin pdurbin removed this from the 6.4 milestone Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Feature: Metadata GREI 2 Consistent Metadata NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... pm.GREI-d-1.5.1 NIH, yr1, aim5, task1: Standardize download metrics pm.GREI-d-1.5.2 NIH, yr1, aim5, task2: WG with other repositories to follow Make Data Count recommendations Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Suggestion an idea
Projects
Status: Interested
Status: Done