Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude others namespace from harvesting "oai_dc" metadata prefix #10837

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

jeromeroucou
Copy link
Contributor

What this PR does / why we need it:

This PR allows the harvesting of certain repository who expose metadata with specific namespace.

Some repository extend the "oai_dc" with specific namespace. For example, SEANOE expose specific metadata with dct namespace. Below, the result of https://www.seanoe.org/oai/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:seanoe.org:41307

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dct="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2024-09-12T12:09:21Z</responseDate>
    <request verb="GetRecord" metadataPrefix="oai_dc" identifier="oai:seanoe.org:41307">
        https://www.seanoe.org/oai/OAIHandler</request>
    <GetRecord>
        <record>
            <header>
                <identifier>oai:seanoe.org:41307</identifier>
                <datestamp>2021-05-12</datestamp>
                <setSpec>GROUP:EMSO</setSpec>
                <setSpec>ec_fundedresources</setSpec>
            </header>
            <metadata>
                <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                    xmlns:dc="http://purl.org/dc/elements/1.1/"
                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
                    <dc:title>Iridium GPS 1 data from the EMSO-Azores observatory, 2014-2015</dc:title>
                    <dc:creator>Legrand, Julien</dc:creator>
                    <dc:creator>Sarradin, Pierre-marie</dc:creator>
                    <dc:creator>Cannat, Mathilde</dc:creator>
                    <dc:subject>Mid-Atlantic Ridge</dc:subject>
                    <dc:subject>EMSO</dc:subject>
                    <dc:subject>Lucky Strike</dc:subject>
                    <dc:subject>Time-series</dc:subject>
                    <dc:subject>Environmental monitoring node</dc:subject>
                    <dc:subject>MoMAR</dc:subject>
                    <dc:subject>BOREL</dc:subject>
                    <dc:subject>GPS</dc:subject>
                    <dc:subject>Position</dc:subject>
                    <dc:description>This dataset contains the GPS positions of the EMSO-Azores
                        transmission buoy BOREL acquired between July 2014 and April 2015 using the
                        Iridium/GPS modem 1 (data acquired every 6 hours).</dc:description>
                    <dc:publisher>SEANOE</dc:publisher>
                    <dc:date>2015-10</dc:date>
                    <dc:type>dataset</dc:type>
                    <dc:identifier>DOI:10.17882/41307</dc:identifier>
                    <dc:identifier>https://doi.org/10.17882/41307</dc:identifier>
                    <dc:identifier>https://www.seanoe.org/data/00302/41307/</dc:identifier>
                    <dc:relation>info:eu-repo/grantAgreement/EC/FP7/312463/EU//FIXO3</dc:relation>
                    <dc:coverage>North 37.30134, South 37.2888, East -32.275618, West -32.27982</dc:coverage>
                    <dct:references>https://www.seanoe.org/data/00302/41307/</dct:references>
                    <dcterms:spatial xsi:type="DCTERMS:Box">37.2888 -32.27982 37.30134 -32.275618</dcterms:spatial>
                    <dc:rights>CC-BY</dc:rights>
                </oai_dc:dc>
            </metadata>
        </record>
    </GetRecord>
</OAI-PMH>

Actually, this record can't be harvested because the following exception occurs :

Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix "dct"
  at [row,col {unknown-source}]: [5,555]
       at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634)
       at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:504)
       at com.ctc.wstx.sr.InputElementStack.resolveAndValidateElement(InputElementStack.java:503)
       at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:3066)
       at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2928)
       at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processXMLElement(ImportGenericServiceBean.java:209)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processOAIDCxml(ImportGenericServiceBean.java:180)
       ... 100 more

We propose to ignore everything that is not the dc namespace which means skip the WstxParsingException.

Which issue(s) this PR closes:

No related issue funded

Special notes for your reviewer:

Not really but I've a suggestion to improve the scope of this pull request with another one (or issue) : the ForeignMetadataFormatMapping can be more flexible and can be used for more namespaces than dcterms. With this, we can add a mapping for dct namespace

Suggestions on how to test this:

Add a new harvesting client with https://www.seanoe.org/oai/OAIHandler server and GROUP:EMSO set.
Before the PR, all datasets are in error, with this PR, all datasets are imported.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Todo

Additional documentation:

@coveralls
Copy link

coveralls commented Sep 12, 2024

Coverage Status

coverage: 20.737% (-0.002%) from 20.739%
when pulling 34cf77d on Recherche-Data-Gouv:harvest_exclude_invalid_tag
into 4b96cec on IQSS:develop.

@qqmyers
Copy link
Member

qqmyers commented Sep 12, 2024

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@luddaniel
Copy link
Contributor

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@qqmyers I'm not sure there is a link.

#10837 comes before in the dsDTO = importGenericService.processOAIDCxml(xmlToParse); where we can experience constraints with xml namespaces due to :FastGetRecord xml truncation, dc: prefix requirement and possible XMLStreamException. Also, a generic OAI archive can send a customised oai_dc content like in the example above.

If I missed something, could you shed some light on it for me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants