Skip to content

Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 26, 2025

Fixes #7066
This PR introduces automatic subset-level grouping for folder-based dataset builders by:

  1. Adding a utility function group_files_by_subset() that clusters files by root name (ignoring digits and shard suffixes).
  2. Integrating this logic into FolderBasedBuilder._split_generators() to yield one split per subset.
  3. Adding unit tests for the grouping function.
  4. Updating the documentation to describe this new behavior under docs/source/repository_structure.mdx.

Motivation

Datasets with files like:


train0.jsonl
train1.jsonl
animals.jsonl
metadata.jsonl

will now be automatically grouped as:

  • "train" subset → train0.jsonl, train1.jsonl
  • "animals" subset → animals.jsonl
  • "metadata" subset → metadata.jsonl

This enables structured multi-subset loading even when the dataset doesn't follow traditional train/validation/test split conventions.


Files Changed

  • src/datasets/data_files.py: added group_files_by_subset() utility
  • src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py: grouped files before yielding splits
  • tests/test_data_files.py: added unit test test_group_files_by_subset
  • docs/source/repository_structure.mdx: documented subset grouping for maintainers and users

Benefits

  • More flexible and robust dataset split logic
  • Enables logical grouping of user-uploaded files without nested folder structure
  • Backward-compatible with all existing folder-based configs

Ready for review ✅

@ArjunJagdale
Copy link
Contributor Author

It adds automatic grouping of files into subsets based on their root name (e.g., train0.jsonl, train1.jsonl"train"), as discussed above. The logic is integrated into FolderBasedBuilder and is fully tested + documented.

Let me know if any changes are needed — happy to iterate!

@lhoestq
Copy link
Member

lhoestq commented Jun 26, 2025

Hi ! I believe the subsets need to be instantiated here as configs - not splits (which are meant for train/validation/test):

if metadata_configs:
builder_configs, default_config_name = create_builder_configs_from_metadata_configs(
module_path,
metadata_configs,
base_path=base_path,
default_builder_kwargs=default_builder_kwargs,
download_config=self.download_config,
)
else:
builder_configs: list[BuilderConfig] = [
import_main_class(module_path).BUILDER_CONFIG_CLASS(
data_files=data_files,
**default_builder_kwargs,
)
]
default_config_name = None

Also the subset names should probably be inferred only from the parquet/csv/json files and not from png/jpeg/wav/mp4 etc. WDYT ?

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jun 26, 2025

Hi ! I believe the subsets need to be instantiated here as configs - not splits (which are meant for train/validation/test):

if metadata_configs:
builder_configs, default_config_name = create_builder_configs_from_metadata_configs(
module_path,
metadata_configs,
base_path=base_path,
default_builder_kwargs=default_builder_kwargs,
download_config=self.download_config,
)
else:
builder_configs: list[BuilderConfig] = [
import_main_class(module_path).BUILDER_CONFIG_CLASS(
data_files=data_files,
**default_builder_kwargs,
)
]
default_config_name = None

Also the subset names should probably be inferred only from the parquet/csv/json files and not from png/jpeg/wav/mp4 etc. WDYT ?

Thanks a lot for the review!

You're absolutely right — treating subsets as separate configs instead of overloaded splits makes much more sense. If that approach sounds good to you, I can move the grouping logic to load.py, where configs are instantiated, and revise the PR to emit one BuilderConfig per grouped subset.

Also totally agree on limiting grouping to structured file types — I’d scope this to .json, .jsonl, .csv, and .parquet.

Let me know if this direction sounds good, and I’ll get started on the changes right away!

@ArjunJagdale ArjunJagdale changed the title Update data_files.py #7066 Introduces automatic subset-level grouping for folder-based dataset builders #7066 Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

One subset per file in repo ?
2 participants