Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

ArjunJagdale · 2025-06-26T07:01:37Z

Fixes #7066
This PR introduces automatic subset-level grouping for folder-based dataset builders by:

Adding a utility function group_files_by_subset() that clusters files by root name (ignoring digits and shard suffixes).
Integrating this logic into FolderBasedBuilder._split_generators() to yield one split per subset.
Adding unit tests for the grouping function.
Updating the documentation to describe this new behavior under docs/source/repository_structure.mdx.

Motivation

Datasets with files like:


train0.jsonl
train1.jsonl
animals.jsonl
metadata.jsonl

will now be automatically grouped as:

"train" subset → train0.jsonl, train1.jsonl
"animals" subset → animals.jsonl
"metadata" subset → metadata.jsonl

This enables structured multi-subset loading even when the dataset doesn't follow traditional train/validation/test split conventions.

Files Changed

src/datasets/data_files.py: added group_files_by_subset() utility
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py: grouped files before yielding splits
tests/test_data_files.py: added unit test test_group_files_by_subset
docs/source/repository_structure.mdx: documented subset grouping for maintainers and users

Benefits

More flexible and robust dataset split logic
Enables logical grouping of user-uploaded files without nested folder structure
Backward-compatible with all existing folder-based configs

Ready for review ✅

ArjunJagdale · 2025-06-26T08:26:14Z

It adds automatic grouping of files into subsets based on their root name (e.g., train0.jsonl, train1.jsonl → "train"), as discussed above. The logic is integrated into FolderBasedBuilder and is fully tested + documented.

Let me know if any changes are needed — happy to iterate!

lhoestq · 2025-06-26T14:02:38Z

Hi ! I believe the subsets need to be instantiated here as configs - not splits (which are meant for train/validation/test):

datasets/src/datasets/load.py

Lines 647 to 662 in ef762e6

    
           if metadata_configs: 
        
               builder_configs, default_config_name = create_builder_configs_from_metadata_configs( 
        
                   module_path, 
        
                   metadata_configs, 
        
                   base_path=base_path, 
        
                   default_builder_kwargs=default_builder_kwargs, 
        
                   download_config=self.download_config, 
        
               ) 
        
           else: 
        
               builder_configs: list[BuilderConfig] = [ 
        
                   import_main_class(module_path).BUILDER_CONFIG_CLASS( 
        
                       data_files=data_files, 
        
                       **default_builder_kwargs, 
        
                   ) 
        
               ] 
        
               default_config_name = None

Also the subset names should probably be inferred only from the parquet/csv/json files and not from png/jpeg/wav/mp4 etc. WDYT ?

ArjunJagdale · 2025-06-26T17:22:25Z

Hi ! I believe the subsets need to be instantiated here as configs - not splits (which are meant for train/validation/test):

datasets/src/datasets/load.py

Lines 647 to 662 in ef762e6

if metadata_configs:

builder_configs, default_config_name = create_builder_configs_from_metadata_configs(

module_path,

metadata_configs,

base_path=base_path,

default_builder_kwargs=default_builder_kwargs,

download_config=self.download_config,

)

else:

builder_configs: list[BuilderConfig] = [

import_main_class(module_path).BUILDER_CONFIG_CLASS(

data_files=data_files,

**default_builder_kwargs,

)

]

default_config_name = None

Also the subset names should probably be inferred only from the parquet/csv/json files and not from png/jpeg/wav/mp4 etc. WDYT ?

Thanks a lot for the review!

You're absolutely right — treating subsets as separate configs instead of overloaded splits makes much more sense. If that approach sounds good to you, I can move the grouping logic to load.py, where configs are instantiated, and revise the PR to emit one BuilderConfig per grouped subset.

Also totally agree on limiting grouping to structured file types — I’d scope this to .json, .jsonl, .csv, and .parquet.

Let me know if this direction sounds good, and I’ll get started on the changes right away!

ArjunJagdale added 4 commits June 26, 2025 12:26

Update data_files.py huggingface#7066

5f10a79

Update folder_based_builder.py

a1b773e

Update test_data_files.py

8f5cd35

Update repository_structure.mdx

aa00985

ArjunJagdale changed the title ~~Update data_files.py #7066~~ Introduces automatic subset-level grouping for folder-based dataset builders #7066 Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

ArjunJagdale commented Jun 26, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Jun 26, 2025

Uh oh!

lhoestq commented Jun 26, 2025

Uh oh!

ArjunJagdale commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

Are you sure you want to change the base?

Introduces automatic subset-level grouping for folder-based dataset builders #7066 #7646

Conversation

ArjunJagdale commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Files Changed

Benefits

Uh oh!

ArjunJagdale commented Jun 26, 2025

Uh oh!

lhoestq commented Jun 26, 2025

Uh oh!

ArjunJagdale commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArjunJagdale commented Jun 26, 2025 •

edited

Loading

ArjunJagdale commented Jun 26, 2025 •

edited

Loading