Unable to download files in Dataverse repositories when files don't have PIDs #354

jggautier · 2023-03-12T18:51:25Z

Description of the problem:

Hi! I was happy to hear that pooch has support for downloading files in repositories that use Dataverse. I played around with the commands a bit today.

The pooch.create and DOIDownloader functions work great when the files I want to download in Dataverse repositories have persistent identifiers. It looks like both functions assume that files in Dataverse repositories will have PIDs.

But many repositories using Dataverse don't register PIDs for their files (see a conversation in the Dataverse Google Group about which repositories do and don't). We can't assume that all files in Dataverse repositories will have PIDs, and it's likely that more repositories using Dataverse will "turn off" PID registration for their files.

So when I try to download a file that doesn't have a PID, I get an error.

Full code that generated the error

Using pooch.create:

data = pooch.create(
	base_url='doi:10.7910/DVN/DCDKZQ',
	path=pooch.os_cache('testing'))
data.load_registry_from_doi()
datafile = data.fetch(
	fname='Indata_2022.10.02_19.44.51.zip')

Using DOIDownloader:

downloader = DOIDownloader()
url = 'doi:10.7910/DVN/DCDKZQ/Indata_2022.10.02_19.44.51.zip'
downloader(
	url=url, 
	output_file='/testing/Indata_2022.10.02_19.44.51.zip',
	pooch=None)

Full error message

Traceback (most recent call last):
  File "/testing_pooch.py", line 28, in <module>
    downloader(
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 611, in __call__
    downloader(download_url, output_file, pooch)
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 207, in __call__
    response.raise_for_status()
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=
[Finished in 2.9s with exit code 1]

System information

Operating system: Mac OS 13.2.1
Python installation (Anaconda, system, ETS): System (I think)
Version of Python: 3.9.12
Version of this package: v1.7.0

The text was updated successfully, but these errors were encountered:

jggautier · 2023-03-12T18:59:57Z

While not all files in Dataverse repositories have PIDs, they always have database IDs, so the Dataverse API endpoint for accessing files can use the file's database ID instead.

I think the download_url function at

pooch/pooch/downloaders.py

Line 1022 in a965902

def download_url(self, file_name):

tries to get a file's PID to create a download url for the file. And when the file doesn't have a PID, it creates a URL, e.g. https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=, that returns that 404 HTTPError.

The same API endpoint can also use the file's database ID, e.g. https://dataverse.harvard.edu/file.xhtml?fileId=6570505, where 6570505 is the database ID of the file "Indata_2022.10.02_19.44.51.zip".

So could that download_url always look for the file's database ID, instead, and use that to create the download_url?

santisoler · 2023-03-13T18:14:15Z

Hi @jggautier! Thanks for opening this issue.

I'll open a PR to fix this bug. I see your point and I do think we should add support for files that don't have a persistentID.

I'm going to download the file using the ID only if the persistentID is empty. This way we will still be using the persistentID for any file that provides it, and fallback to the file ID for these corner cases.

cc @dokempf

jggautier · 2023-03-13T18:22:09Z

That sounds great, thanks!

Just in case it's helpful, I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs. That is, when I checked in October, 49 of 70 repositories had files with no persistentIDs. And the Dataverse community has been talking about turning off file PID registration (it's on by default, and many repositories have had to turn it off due to cost and performance issue), so I think it's likely that more files will be published without file PIDs than with them.

dokempf · 2023-03-13T18:27:59Z

Thanks for the report @jggautier - I was implementing this based on my own DataVerse experience, which seems to be with an instance that has PIDs on files. I was not aware of the fact.

@santisoler I am also available for this if your too constrained - just ask. Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite? Maybe @jggautier could provide access to one.

santisoler · 2023-03-13T19:57:54Z

Thanks @dokempf for jumping in. I've already open an PR to fix this (#355), but it still needs some tests. Feel free to take over that PR.

Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite?

For sure, that would be excellent.

I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs.

Sorry, I misunderstood it. Thanks for the clarification.

I see that probably persistentIDs will become obsolete in the near future. But since the documentation of the Dataverse API offers the persistentID as the first option and mentions the id as an alternative, I would be inclined to keep the former as the first approach and fallback to the latter if the persistent id is empty (the changes in #355 reflect this). I think it's a conservative decision that lowers the chance of breaking backward compatibility, doesn't introduce a performance hit (it just needs to check if an string is empty) and actually solves this bug.

Let me know if there's any detail I'm not seeing. I'm actually quite new to Dataverse. Actually, I found out about it through @dokempf contributions to Pooch (so big thanks for that).

jggautier · 2023-03-13T20:27:28Z

Your point about what the API documentation implies makes sense to me. Thanks so much for the insight. I'd guess that documentation was written when the ability to register persistentIDs for files was added to Dataverse, and it was assumed that most files would be getting persistentIDs. It wasn't anticipated that repositories would need to turn off file persistentIDs. @pdurbin who's also at Dataverse and works on Dataverse APIs a lot would know more. :)

It's possible that in the future, different types of users, like depositors and curators, will have more control over which of their datasets in a repository do and don't get file persistentIDs. (And the documentation will be updated.)

You're approach sounds fine to me, although I'm no developer! :)

Thanks again!

leouieda · 2023-03-14T07:47:18Z

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

dokempf · 2023-03-14T07:59:17Z

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

leouieda · 2023-03-14T10:19:51Z

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

Ah, so both the database ID and PID are in that same response? Then forget what I said 🙂

pdurbin · 2023-03-17T18:50:01Z

Hi! I just left a review: #355 (review)

jggautier added the bug Report a problem that needs to be fixed label Mar 12, 2023

santisoler mentioned this issue Mar 13, 2023

Support Dataverse files without a persistentID #355

Merged

1 task

dokempf mentioned this issue Mar 14, 2023

DataVerse file DOIs in pooch.retrieve #356

Open

leouieda closed this as completed in #355 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to download files in Dataverse repositories when files don't have PIDs #354

Unable to download files in Dataverse repositories when files don't have PIDs #354

jggautier commented Mar 12, 2023 •

edited

Loading

jggautier commented Mar 12, 2023 •

edited

Loading

santisoler commented Mar 13, 2023

jggautier commented Mar 13, 2023

dokempf commented Mar 13, 2023

santisoler commented Mar 13, 2023

jggautier commented Mar 13, 2023 •

edited

Loading

leouieda commented Mar 14, 2023

dokempf commented Mar 14, 2023

leouieda commented Mar 14, 2023

pdurbin commented Mar 17, 2023

Unable to download files in Dataverse repositories when files don't have PIDs #354

Unable to download files in Dataverse repositories when files don't have PIDs #354

Comments

jggautier commented Mar 12, 2023 • edited Loading

jggautier commented Mar 12, 2023 • edited Loading

santisoler commented Mar 13, 2023

jggautier commented Mar 13, 2023

dokempf commented Mar 13, 2023

santisoler commented Mar 13, 2023

jggautier commented Mar 13, 2023 • edited Loading

leouieda commented Mar 14, 2023

dokempf commented Mar 14, 2023

leouieda commented Mar 14, 2023

pdurbin commented Mar 17, 2023

jggautier commented Mar 12, 2023 •

edited

Loading

jggautier commented Mar 12, 2023 •

edited

Loading

jggautier commented Mar 13, 2023 •

edited

Loading