Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download files in Dataverse repositories when files don't have PIDs #354

Closed
jggautier opened this issue Mar 12, 2023 · 10 comments · Fixed by #355
Closed

Unable to download files in Dataverse repositories when files don't have PIDs #354

jggautier opened this issue Mar 12, 2023 · 10 comments · Fixed by #355
Labels
bug Report a problem that needs to be fixed

Comments

@jggautier
Copy link

jggautier commented Mar 12, 2023

Description of the problem:

Hi! I was happy to hear that pooch has support for downloading files in repositories that use Dataverse. I played around with the commands a bit today.

The pooch.create and DOIDownloader functions work great when the files I want to download in Dataverse repositories have persistent identifiers. It looks like both functions assume that files in Dataverse repositories will have PIDs.

But many repositories using Dataverse don't register PIDs for their files (see a conversation in the Dataverse Google Group about which repositories do and don't). We can't assume that all files in Dataverse repositories will have PIDs, and it's likely that more repositories using Dataverse will "turn off" PID registration for their files.

So when I try to download a file that doesn't have a PID, I get an error.

Full code that generated the error

Using pooch.create:

data = pooch.create(
	base_url='doi:10.7910/DVN/DCDKZQ',
	path=pooch.os_cache('testing'))
data.load_registry_from_doi()
datafile = data.fetch(
	fname='Indata_2022.10.02_19.44.51.zip')

Using DOIDownloader:

downloader = DOIDownloader()
url = 'doi:10.7910/DVN/DCDKZQ/Indata_2022.10.02_19.44.51.zip'
downloader(
	url=url, 
	output_file='/testing/Indata_2022.10.02_19.44.51.zip',
	pooch=None)

Full error message

Traceback (most recent call last):
  File "/testing_pooch.py", line 28, in <module>
    downloader(
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 611, in __call__
    downloader(download_url, output_file, pooch)
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 207, in __call__
    response.raise_for_status()
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=
[Finished in 2.9s with exit code 1]

System information

  • Operating system: Mac OS 13.2.1
  • Python installation (Anaconda, system, ETS): System (I think)
  • Version of Python: 3.9.12
  • Version of this package: v1.7.0
@jggautier jggautier added the bug Report a problem that needs to be fixed label Mar 12, 2023
@jggautier
Copy link
Author

jggautier commented Mar 12, 2023

While not all files in Dataverse repositories have PIDs, they always have database IDs, so the Dataverse API endpoint for accessing files can use the file's database ID instead.

I think the download_url function at

def download_url(self, file_name):
tries to get a file's PID to create a download url for the file. And when the file doesn't have a PID, it creates a URL, e.g. https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=, that returns that 404 HTTPError.

The same API endpoint can also use the file's database ID, e.g. https://dataverse.harvard.edu/file.xhtml?fileId=6570505, where 6570505 is the database ID of the file "Indata_2022.10.02_19.44.51.zip".

So could that download_url always look for the file's database ID, instead, and use that to create the download_url?

@santisoler
Copy link
Member

Hi @jggautier! Thanks for opening this issue.

I'll open a PR to fix this bug. I see your point and I do think we should add support for files that don't have a persistentID.

I'm going to download the file using the ID only if the persistentID is empty. This way we will still be using the persistentID for any file that provides it, and fallback to the file ID for these corner cases.

cc @dokempf

@jggautier
Copy link
Author

That sounds great, thanks!

Just in case it's helpful, I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs. That is, when I checked in October, 49 of 70 repositories had files with no persistentIDs. And the Dataverse community has been talking about turning off file PID registration (it's on by default, and many repositories have had to turn it off due to cost and performance issue), so I think it's likely that more files will be published without file PIDs than with them.

@dokempf
Copy link
Contributor

dokempf commented Mar 13, 2023

Thanks for the report @jggautier - I was implementing this based on my own DataVerse experience, which seems to be with an instance that has PIDs on files. I was not aware of the fact.

@santisoler I am also available for this if your too constrained - just ask. Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite? Maybe @jggautier could provide access to one.

@santisoler
Copy link
Member

Thanks @dokempf for jumping in. I've already open an PR to fix this (#355), but it still needs some tests. Feel free to take over that PR.

Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite?

For sure, that would be excellent.

I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs.

Sorry, I misunderstood it. Thanks for the clarification.

I see that probably persistentIDs will become obsolete in the near future. But since the documentation of the Dataverse API offers the persistentID as the first option and mentions the id as an alternative, I would be inclined to keep the former as the first approach and fallback to the latter if the persistent id is empty (the changes in #355 reflect this). I think it's a conservative decision that lowers the chance of breaking backward compatibility, doesn't introduce a performance hit (it just needs to check if an string is empty) and actually solves this bug.

Let me know if there's any detail I'm not seeing. I'm actually quite new to Dataverse. Actually, I found out about it through @dokempf contributions to Pooch (so big thanks for that).

@jggautier
Copy link
Author

jggautier commented Mar 13, 2023

Your point about what the API documentation implies makes sense to me. Thanks so much for the insight. I'd guess that documentation was written when the ability to register persistentIDs for files was added to Dataverse, and it was assumed that most files would be getting persistentIDs. It wasn't anticipated that repositories would need to turn off file persistentIDs. @pdurbin who's also at Dataverse and works on Dataverse APIs a lot would know more. :)

It's possible that in the future, different types of users, like depositors and curators, will have more control over which of their datasets in a repository do and don't get file persistentIDs. (And the documentation will be updated.)

You're approach sounds fine to me, although I'm no developer! :)

Thanks again!

@leouieda
Copy link
Member

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

@dokempf
Copy link
Contributor

dokempf commented Mar 14, 2023

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

@leouieda
Copy link
Member

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

Ah, so both the database ID and PID are in that same response? Then forget what I said 🙂

@pdurbin
Copy link
Contributor

pdurbin commented Mar 17, 2023

Hi! I just left a review: #355 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report a problem that needs to be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants