Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File metadata going to DataCite is not escaped #5546

Closed
qqmyers opened this issue Feb 14, 2019 · 6 comments
Closed

File metadata going to DataCite is not escaped #5546

qqmyers opened this issue Feb 14, 2019 · 6 comments
Assignees

Comments

@qqmyers
Copy link
Member

qqmyers commented Feb 14, 2019

In developing #5506, which compares the XML returned from DataCite with what DV sends, we discovered that, unlike dataset metadata, file metadata (label, description are the only things sent to DataCite currently) are not escaped when sent to DataCite. So, for example, the Title "A & B" when sent to DataCite is currently stripped to read "A B" and only if it is escaped to read ("A & B") does DataCite capture the ampersand (correctly displaying it and returning the escaped title to DV so that the returned XML matches what was sent.

Long-term, it might make sense to escape the values being returned by DataFile.getDescription() and DataFile.getDisplayName() (overrides DvObject.getDisplayName()) and adjusting all the places that this information is viewed - to be more parallel to how Dataset works, but, for now, just escaping in the DataCite code is enough to address the issue. I'll submit a PR with that change...

@pdurbin
Copy link
Member

pdurbin commented Feb 14, 2019

@qqmyers I approved pull request #5548 and I'm wondering how related this is to #3328 and if you feel like making a pull request for that issue too. 😄

@qqmyers
Copy link
Member Author

qqmyers commented Feb 14, 2019

@pdurbin - it's tempting but... FWIW - it looks like the dataset metadata is html escaped which handles things like a bare ampersand (which is more than what happens for file metadata...) but really what's needed is the same StringEscapeUtils.escapeXml() call I'm making here, which I think handles anything that would make invalid xml such as   or <>, etc. (just assuming, haven't checked). If that's the case, to cover #3328 (without just restricting what users can enter), I think you'd have to scan through DOIDataCiteRegisterService.generateXML() and escape each string before adding it to the template.

@kcondon kcondon self-assigned this Feb 14, 2019
@kcondon
Copy link
Contributor

kcondon commented Feb 14, 2019

@qqmyers Dumb question but if I put A & B in file description and fetch the xml from Datacite when your fix is in place, I see

<description descriptionType="Abstract">A &amp; B</description>

Was that your intention? I see it is escaped in the xml

Thanks for any info.

Also cannot publish a file at all with EZID if set file descp as A & B, even with escaped code.

@qqmyers
Copy link
Member Author

qqmyers commented Feb 14, 2019

@kcondon - yes. If you look in Fabrica, they display without the escaping:
image
where the underlying XML is
<description descriptionType="Abstract">test &amp; 2</description>

(without the escaping, they completely drop the &)

@djbrooke
Copy link
Contributor

@kcondon flagged in standup that this fix does not work for EZID. I said it was OK to merge this as it provides improvement for the large number of installations using DataCite.

@qqmyers
Copy link
Member Author

qqmyers commented Feb 15, 2019

Yeah - I haven't looked at what is sent in EZID, but the formatting of the metadata is Provider/Provider API specific. If the metadata values could be made neutral-enough (no special characters w.r.t. xml, json, any format used by providers) a change outside the Provider classes might work, but if special characters are allowed, I think in general it would have to be the Provider classes handling the escaping/cleaning for their particular APIs/transfer formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants