python(feat): Add HDF5 upload service #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

nathan-sift wants to merge 11 commits into main from python/hdf5-upload-service

Contributor

nathan-sift commented Jul 1, 2025

Adds an Hdf5UploadService to allow ingestion of HDF5 files. Also adds an Hdf5Config used to define the incoming data.

hdf5_upload_service = Hdf5UploadService(rest_config)
hdf5_upload_service.upload("sample_data.h5", hdf5_config)

Uses the following config format:

{
  "asset_name": string,
  "run_name": string,
  "run_id": string,
  "time": {
    "format": string              // Required. See "Time Formats" below for options.
    "relative_start_time": string // Required if "format" is a relative time format.
                                  // Must be in RFC3339 format.
                                  // Will be added to all relative times to
                                  // generate an absolute timestamp.
  },
  "data": [
    {
      "name": string,
      "description": string,
      "units": string,
      
      "time_dataset": string,   // Example: "/sensors/A/timestamps",
      "time_column": number, // Optional: Column to extract from for 2D arrays. Default value = 1
      "value_dataset": string,
      "value_column": number, // Optional: Column to extract from for 2D arrays. Default value = 1
      
      "enum_types": [       // Optional. Only valid if data type is an enum
        {
          "key": number,     // The raw enum value
          "name": string     // The display value for the enum value
        },
      "bit_field_elements: [   // Optional. Only valid if data type is a bit field element
        {
          "index": number,    // Starting index of the bit_field
          "name": string,     // Name of the bit field element.
          "bit_count": number // Number of bits in the element
        },
    },
    ...
  ]
}

Verification

See included unit testing.

Example includes ingestion of a sample hdf5 file

HDFView of file

Uploaded to Sift

nathan-sift added 4 commits

July 1, 2025 13:54


          hdf5 upload service

904465a


          ruff linting

35f23e1


          fix mypy typing

1060b0c


          Add optional hdf5 to CI deps

a79a4ea

nathan-sift requested a review from marcsiftstack

July 1, 2025 21:29

marcsiftstack requested changes

View reviewed changes

python/lib/sift_py/data_import/_config.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/_config.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/hdf5.py

+                  if non_string_config_dict["data"]:
+                      filtered_hdf5_configs.append(Hdf5Config(non_string_config_dict))
+                  for data_cfg in hdf5_config._hdf5_config.data:

Contributor

marcsiftstack Jul 2, 2025

Is it possible to combine all of the string channels into a single separate config?

Contributor Author

nathan-sift Jul 9, 2025

Unfortunately not. Since we have to assume unique timestamps between the different string channels, a single CSV file containing the string channels can end up including null data points (",," in CSV) which get ingested as valid data points containing an "" string. It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value.

We could potentially group string channels using the same timestamp "source", but I think that overcomplicates the code, versus considering a future improvement where we ingest these values using a more flexible method than CSV.

Contributor

marcsiftstack Jul 16, 2025 •

edited

Loading

It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value

Do you know if this is intentional? If not, can you open a ticket to see if we can fix this? Fine with this implementation in the meantime.

Contributor Author

nathan-sift Jul 16, 2025

Ticket opened for future work

python/lib/sift_py/data_import/hdf5.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/hdf5.py Show resolved Hide resolved

python/lib/sift_py/data_import/hdf5.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/hdf5.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/hdf5.py Show resolved Hide resolved

python/lib/sift_py/data_import/_hdf5_test.py Outdated Show resolved Hide resolved

python/lib/sift_py/data_import/_hdf5_test.py Show resolved Hide resolved

nathan-sift added 2 commits

July 9, 2025 12:18


          Added base ConfigTimeModel and ConfigDataModel

2cf4763


          Address PR comments

nathan-sift requested a review from marcsiftstack

July 10, 2025 05:25

marcsiftstack reviewed

View reviewed changes

python/lib/sift_py/data_import/hdf5.py Outdated

+                          # First convert each csv file
+                          csv_items: List[Tuple[str, CsvConfig]] = []
+                          for config in split_configs:
+                              temp_file = stack.enter_context(NamedTemporaryFile(mode="wt", suffix=".csv"))

Contributor

marcsiftstack Jul 16, 2025

"t" is the default so it's not needed here. We use "wt" in some other places when writing to a compressed file because text mode is not the default with gzip.open.

Contributor Author

nathan-sift Jul 18, 2025

marcsiftstack reviewed

View reviewed changes

python/lib/sift_py/data_import/hdf5.py Outdated

-                      import_services = []
-                      for config in split_configs:
-                          with NamedTemporaryFile(mode="w", suffix=".csv") as temp_file:
+                      # Ensures all temp files opened under stack.enter_context() will have __exit__ called as with a standard with statement

Contributor

marcsiftstack Jul 16, 2025

It's unclear to me why we need to useExitStack. Is there a reason why this doesn't work with the standard with syntax?

csv_items: List[Tuple[str, CsvConfig]] = []
for config in split_configs:
    with NamedTemporaryFile(mode="wt", suffix=".csv") as temp_file:
        csv_config = _convert_to_csv_file(
            path,
            temp_file,
            config,
        )
        csv_items.append((temp_file.name, csv_config))

if hdf5_config._hdf5_config.run_name != "":
    ...

import_services = []
for filename, csv_config in csv_items:
    ...

return import_services

Contributor Author

nathan-sift Jul 16, 2025

With NamedTemporaryFile, once you exit the with block, the temp file isn't just closed, but also deleted, so you wouldn't be able to upload the CSV files with the approach above.

Contributor

marcsiftstack Jul 16, 2025

Ah ok. Might be more clear if the comment says that. Something like "Use ExitStack so that temporary files are not removed until all files have been processed and uploaded".

Contributor Author

nathan-sift Jul 18, 2025

marcsiftstack reviewed

View reviewed changes

python/lib/sift_py/data_import/hdf5.py Show resolved Hide resolved

marcsiftstack self-requested a review

July 16, 2025 02:55

This comment was marked as resolved.

Sign in to view

nathan-sift added 5 commits

July 16, 2025 22:47


          Remove redundant flag in mode

febb50e


          Perf improvements


          Comment on exit stack

0dfe683


          ruff fixes

6fa9a70


          type fix

456858a

nathan-sift requested a review from marcsiftstack

July 18, 2025 23:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet