-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDV] Filenames in underlying storage should be human readable #4041
Comments
This is something that would be useful for us as well - the first of these scenarios is also relevant to compute tasks on data files / large datasets in Dataverse, regardless of the underlying storage (swift, POSIX, s3, etc); or more generically whenever Dataverse is sharing storage with another system. |
Simplest solution (via @scolapasta and @ferrys) is to include a file in the container that gives the mapping of pretty filenames to actual filenames. Also, to allow for the globbing (e.g. *.csv) use case that I mentioned, we can keep the file extension but replace the file name with the unique identifier. Still to be determined is compatibility with Geomesa/Accumulo/etc use cases. |
Just a note that with the new Globus support, filenames are mapped from their weird underlying names on S3 to the human readable names one would expect. This is the PR: |
@jeremyfreudberg are you still interested in this? Thanks. |
@pdurbin I haven't worked on Dataverse stuff since 2018 (but if you're hiring...). I'm not sure whether the MOC folks are still interested in this. If from your end it makes sense to simply close out this issue, then please do so. |
#2909 (comment) affirmed that in Cloud Dataverse a filename in the underlying storage (Swift) would be a "filesystem name", which is unique, but also not human-readable.
The lack of a true rename operation in Swift, worries about uniqueness, and the fact that the Dataverse download API preserves meaningful filenames anyway meant that at the time we were satisfied with the solution of non-pretty names in Swift.
Two specific scenarios where pretty filenames are wanted/needed:
More generally, the relevant scenarios can be summarized as any time someone or some service uses the Swift API to download files. We are currently dreaming up more applications (besides Hadoop/Spark via Sahara) which would prefetch files from the Swift endpoint for the user play with using compute. In the current state of CDV, the user wouldn't be able to tell what's going on, since they would receive a whole bunch of random files (anything bundled with the dataset, not just raw data) with no way to tell what's what.
Worth noting that these concerns are really especially relevant for larger datasets -- direct access through the Swift API instead of the Dataverse API is crucial in that case.
This discussion also ties into a larger discussion about how dataset versioning is reflected on the Swift side of things.
The text was updated successfully, but these errors were encountered: