Add columns support to JSON loader for selective key filtering #7652
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.
As suggested, support for the
columns=...
argument (previously available for Parquet) has now been extended to JSON and JSONL loading viaload_dataset(...)
. You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.Example:
Summary of changes:
columns: Optional[List[str]]
toJsonConfig
_generate_tables()
to filter selected columnscolumns
argument fromload_dataset()
to the configLet me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.