-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It adds automatic grouping of files into subsets based on their root name (e.g., Let me know if any changes are needed — happy to iterate! |
Hi ! I believe the subsets need to be instantiated here as Lines 647 to 662 in ef762e6
Also the subset names should probably be inferred only from the parquet/csv/json files and not from png/jpeg/wav/mp4 etc. WDYT ? |
Thanks a lot for the review! You're absolutely right — treating subsets as separate configs instead of overloaded splits makes much more sense. If that approach sounds good to you, I can move the grouping logic to Also totally agree on limiting grouping to structured file types — I’d scope this to Let me know if this direction sounds good, and I’ll get started on the changes right away! |
Fixes #7066
This PR introduces automatic subset-level grouping for folder-based dataset builders by:
group_files_by_subset()
that clusters files by root name (ignoring digits and shard suffixes).FolderBasedBuilder._split_generators()
to yield one split per subset.docs/source/repository_structure.mdx
.Motivation
Datasets with files like:
will now be automatically grouped as:
"train"
subset →train0.jsonl
,train1.jsonl
"animals"
subset →animals.jsonl
"metadata"
subset →metadata.jsonl
This enables structured multi-subset loading even when the dataset doesn't follow traditional
train/validation/test
split conventions.Files Changed
src/datasets/data_files.py
: addedgroup_files_by_subset()
utilitysrc/datasets/packaged_modules/folder_based_builder/folder_based_builder.py
: grouped files before yielding splitstests/test_data_files.py
: added unit testtest_group_files_by_subset
docs/source/repository_structure.mdx
: documented subset grouping for maintainers and usersBenefits
Ready for review ✅