Skip to content

feat: Add h5folder dataset loader for HDF5 support #7625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

Related Issue

Closes #3113

What does this PR do?

This PR introduces a new dataset loader module called h5folder to support loading datasets stored in HDF5 (.h5) format.

It allows users to do:

from datasets import load_dataset
dataset = load_dataset("h5folder", data_dir="path/to/")

🧩 Design Overview

  • Implemented inside datasets/packaged_modules/h5folder/h5folder.py
  • Based on the GeneratorBasedBuilder API
  • Uses h5py to read HDF5 files and yield examples
  • Expects datasets such as id, data, and label inside data.h5
  • Converts numpy arrays to Python types before yielding

🧪 Example .h5 Structure (for local testing)

import h5py
import numpy as np

with h5py.File("data.h5", "w") as f:
    f.create_dataset("id", data=np.arange(100))
    f.create_dataset("data", data=np.random.randn(100, 10))
    f.create_dataset("label", data=np.random.randint(0, 2, size=100))

✅ Testing

  • The loader logic follows the structure of existing modules like imagefolder
  • Will rely on Hugging Face CI to validate integration
  • Manually testing planned once merged or during feedback

📁 Files Added

  • datasets/src/datasets/packaged_modules/h5folder/h5folder.py

📌 Component(s) Affected

  • area/datasets
  • area/load

📦 Release Note Classification

  • rn/feature – Adds support for loading .h5 datasets via load_dataset("h5folder", ...)

Let me know if any changes or improvements are needed — happy to iterate. Thanks for reviewing!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jun 26, 2025

I guess test failed cause import os, import h5py, and import datasets lines are not alphabetically sorted, or not grouped properly.

image

Reordered import statements in h5folder.py (datasets, h5py, os) to follow alphabetical order as required by ruff (I001). This resolves the failed check_code_quality workflow in PR huggingface#7625.
@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jun 26, 2025

This commit was accidental - [Merge branch 'main' into patch-4]. The
[chore: fix import order in h5folder.py to satisfy linter] should solve the import order issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading Data from HDF files
2 participants