Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE REQUEST: Make the .tsv files that are part of a downloaded dataset available separately #26

Open
KathyReid opened this issue Oct 22, 2023 · 1 comment

Comments

@KathyReid
Copy link

User story

  • As a researcher, I frequently create data visualisations based on the validated.tsv file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.

I want to be able to get just the .tsv files related to a release, without downloading the clips, so that I can do faster data visualisations.

Acceptance criteria

  • The files

    • clip_durations.tsv
    • invalidated.tsv
    • other.tsv
    • reported.tsv
    • validated.tsv

are available

  • for each language in the CV corpus (about 103 at time of writing)
  • for each version
  • including delta releases

from the CV datasets download page, in the same way as we currently download the .tar.gz formatted datasets.

@HarikalarKutusu
Copy link

Thank you for posting this @KathyReid...
I raised this request a lot of times, in Discourse, in meetings, and in one-to-one talks, whenever I decided to create the CV Metadata Viewer and CV Dataset Analyzer webapps. To be able to update these apps, I download every dataset, now 615 GB on disk, every 3 months, takes 2-3 days - a waste of bandwidth and hits the ecology with unnecessary carbon footprint. I only work on Turkic languages for training, so 114-11 > 100 language downloads are wasted.

Two notes:

  • Default splits should also be included (train.tsv, dev.tsv, test.tsv).
  • Maybe the correct repo for this is the common-voice-bundler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants