Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular Ingest - identical file names with different extension, e.g. .xlsx and . csv #6991

Closed
mheppler opened this issue Jun 17, 2020 · 8 comments
Labels
Feature: File Upload & Handling Status: Needs Reproducing Someone should try to reproduce the issue to make sure it's still valid. Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.

Comments

@mheppler
Copy link
Contributor

@philippconzett Thank you for the suggestion in our community mtg chat. I took the liberty of creating a new issue here in GitHub. If you could, please provide any more details about your use case here. We briefly discussed this as a team and think this is something we can improve upon.

From Philipp Conzett

Great! One more nice feature would be accepting identical files with different file extension, e.g. .xlsx and . csv.

@landreev landreev changed the title File Ingest - identical files with different file extension, e.g. .xlsx and . csv Tabular Ingest - identical file names with different extension, e.g. .xlsx and . csv Jun 17, 2020
@pdurbin
Copy link
Member

pdurbin commented Jun 17, 2020

@philippconzett do you think we could reproduce this issue with the files you provided at IQSS/dataverse-sample-data@f3ef7ee ?

You made a nice detailed commit message which I'll copy and paste below:

  • two identical tabular data files, one Excel file (.xlsx), and one tab-separated plain text fil (.txt)
  • json metadata file of the dataset
    We'd like our depositors to be able to upload tabular data in the original file format (e.g. Excel; .xlsx) and in a preferred file format (tab-separated plain text; .txt). These file should have the same file names except for the file extension. Currently, such files are handled like this in Dataverse:
  • The Excel file is ingested.
  • Dataverse recognizes identical content in the Excel file and the .txt file. Therefore, a "1" is added to the file name of the .txt file.
    For more information, see this discussion in the Dataverse Google Group: https://groups.google.com/forum/?hl=no#!topic/dataverse-community/_2Tm2B2sQhc

@landreev
Copy link
Contributor

landreev commented Jun 17, 2020

We should've double-checked what Dataverse v5 (release candidate) is currently doing, before opening the issue. Because there is a chance this is already being handled the way we want.
Also, just a reminder for everybody to be careful with the terminology. It's been a source of confusion - how "identical files" should be interpreted; whether we are talking about files with identical content, or files with the same file name.
We have already made changes necessary to accept files with identical content. And files with the same names, but different extensions are not a problem by themselves.
But we did realize, in an internal discussion, that there was one somewhat special case we weren't sure about: file names of "ingestable" files that have different extensions as uploaded (like "data.csv" and "data.xlsx", like in the example above), that will become the same ("data.tab") once or if tabular ingest succeeds.
So this is what this issue is about - to investigate how this case is being handled now; and address the behavior if needed.

@landreev
Copy link
Contributor

@pdurbin - Thanks for adding that extra text btw; it clarifies and disambiguates a lot. (I typed my comment above before seeing it). So yes, this is definitely about the special case of what happens when files (and filenames) get modified after the fact, by the tabular ingest process.

When trying to reproduce, let's remember to test against the current develop/v5 draft branch, and not the currently-released v4.20. Because that functionality, that deals with "duplicates" and such has been modified since then.

@mheppler
Copy link
Contributor Author

mheppler commented Jun 17, 2020

Recent improvements to file handling mentioned above were merged in with the pull request 6574 filenames #6893 and will be included in our next 5.0 release.

There is also 4813 allow duplicate files #6924 which will soon follow onto the 5.0 train.

@philippconzett
Copy link
Contributor

Thanks for creating this issue, and sorry for my late reply. I now see that you already have figured out that most parts of this issue are fixed in V5. I just would like to repeat once more the reason for the request:

In DataverseNO, we require depositors to provide tabular data as tab-separated plain text files with the extension .txt (which is default in Excel with Norwegian settings). If they want, the also can provide the same content in the original file format, e.g. .xlsx or .ods. We also require the .txt file and the file in the original file format to have the same file name (expect for the file extension), because this makes it much easier for me to check once a year whether all files in DataverseNO are in a preferred file format (this check is part of our Preservation Plan).

Ideally, we also would like the tab-separated .txt file to be properly ingested, but I guess this related to another issue.

@pdurbin
Copy link
Member

pdurbin commented Jun 26, 2020

@philippconzett I'd feel remiss if I didn't mention that @donsizemore and I made and merged IQSS/dataverse-sample-data#20 yesterday because we were having trouble with the files you added. The files are still there and can be tested but what I think we were observing was that the text file couldn't be added because of it was a duplicate of a file that had been ingested. That's my theory anyway. And duplicate file handling will be more permissive once #6924 gets merged.

@pdurbin
Copy link
Member

pdurbin commented Oct 7, 2022

duplicate file handling will be more permissive once #6924 gets merged

This was merged so it's probably time to re-test. @philippconzett do you want to try?

@pdurbin pdurbin added Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc. Status: Needs Reproducing Someone should try to reproduce the issue to make sure it's still valid. labels Oct 8, 2023
@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Status: Needs Reproducing Someone should try to reproduce the issue to make sure it's still valid. Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.
Projects
None yet
Development

No branches or pull requests

5 participants