Skip to content

Add columns support to JSON loader for selective key filtering #7652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025

Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.

As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

Example:

from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

Summary of changes:

  • Added columns: Optional[List[str]] to JsonConfig
  • Updated _generate_tables() to filter selected columns
  • Forwarded columns argument from load_dataset() to the config
  • Added test case to validate behavior

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

@ArjunJagdale ArjunJagdale changed the title temp1 Add columns parameter to JSON loader to filter selected columns during loading Jun 27, 2025
@ArjunJagdale ArjunJagdale changed the title Add columns parameter to JSON loader to filter selected columns during loading Add columns support to JSON loader for selective key filtering Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to ignore keys/columns when loading a dataset from jsonl(or any other data format)
1 participant