Skip to content

Enable parallel shard upload in push_to_hub() using num_proc #7649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025

Fixes #7591

Add num_proc support to push_to_hub() for parallel shard upload

This PR adds support for parallel upload of dataset shards via the num_proc argument in Dataset.push_to_hub().

📌 While the num_proc parameter was already present in the push_to_hub() signature and correctly passed to _push_parquet_shards_to_hub(), it was not being used to parallelize the upload.

🔧 This PR updates the internal _push_parquet_shards_to_hub() function to:

  • Use multiprocessing.Pool and iflatmap_unordered() for concurrent shard upload when num_proc > 1
  • Preserve original serial upload behavior if num_proc is None or ≤ 1
  • Keep tqdm progress and commit behavior unchanged

Let me know if any test coverage or further changes are needed!

This PR adds support for parallel shard uploads in `Dataset.push_to_hub()` using the existing `num_proc` argument.

- Changes are implemented inside `_push_parquet_shards_to_hub()`
- Uses `multiprocessing.Pool` and `iflatmap_unordered` when `num_proc > 1`
- Preserves original serial behavior if `num_proc` is None or ≤ 1

Fixes huggingface#7591
@ArjunJagdale ArjunJagdale changed the title Add num_proc support to push_to_hub() for parallel shard upload Enable parallel shard upload in push_to_hub() using num_proc Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add num_proc parameter to push_to_hub
1 participant