Skip to content

python(feat): Add HDF5 upload service #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

nathan-sift
Copy link
Contributor

Adds an Hdf5UploadService to allow ingestion of HDF5 files. Also adds an Hdf5Config used to define the incoming data.

hdf5_upload_service = Hdf5UploadService(rest_config)
hdf5_upload_service.upload("sample_data.h5", hdf5_config)

Uses the following config format:

{
  "asset_name": string,
  "run_name": string,
  "run_id": string,
  "time": {
    "format": string              // Required. See "Time Formats" below for options.
    "relative_start_time": string // Required if "format" is a relative time format.
                                  // Must be in RFC3339 format.
                                  // Will be added to all relative times to
                                  // generate an absolute timestamp.
  },
  "data": [
    {
      "name": string,
      "description": string,
      "units": string,
      
      "time_dataset": string,   // Example: "/sensors/A/timestamps",
      "time_column": number, // Optional: Column to extract from for 2D arrays. Default value = 1
      "value_dataset": string,
      "value_column": number, // Optional: Column to extract from for 2D arrays. Default value = 1
      
      "enum_types": [       // Optional. Only valid if data type is an enum
        {
          "key": number,     // The raw enum value
          "name": string     // The display value for the enum value
        },
      "bit_field_elements: [   // Optional. Only valid if data type is a bit field element
        {
          "index": number,    // Starting index of the bit_field
          "name": string,     // Name of the bit field element.
          "bit_count": number // Number of bits in the element
        },
    },
    ...
  ]
}

Verification

See included unit testing.

Example includes ingestion of a sample hdf5 file

HDFView of file
Screenshot 2025-07-01 at 2 02 15 PM

Uploaded to Sift
Screenshot 2025-07-01 at 2 08 30 PM

@nathan-sift nathan-sift requested a review from marcsiftstack July 1, 2025 21:29
if non_string_config_dict["data"]:
filtered_hdf5_configs.append(Hdf5Config(non_string_config_dict))

for data_cfg in hdf5_config._hdf5_config.data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to combine all of the string channels into a single separate config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. Since we have to assume unique timestamps between the different string channels, a single CSV file containing the string channels can end up including null data points (",," in CSV) which get ingested as valid data points containing an "" string. It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value.

We could potentially group string channels using the same timestamp "source", but I think that overcomplicates the code, versus considering a future improvement where we ingest these values using a more flexible method than CSV.

Copy link
Contributor

@marcsiftstack marcsiftstack Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value

Do you know if this is intentional? If not, can you open a ticket to see if we can fix this? Fine with this implementation in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket opened for future work

# First convert each csv file
csv_items: List[Tuple[str, CsvConfig]] = []
for config in split_configs:
temp_file = stack.enter_context(NamedTemporaryFile(mode="wt", suffix=".csv"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"t" is the default so it's not needed here. We use "wt" in some other places when writing to a compressed file because text mode is not the default with gzip.open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import_services = []
for config in split_configs:
with NamedTemporaryFile(mode="w", suffix=".csv") as temp_file:
# Ensures all temp files opened under stack.enter_context() will have __exit__ called as with a standard with statement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why we need to useExitStack. Is there a reason why this doesn't work with the standard with syntax?

csv_items: List[Tuple[str, CsvConfig]] = []
for config in split_configs:
    with NamedTemporaryFile(mode="wt", suffix=".csv") as temp_file:
        csv_config = _convert_to_csv_file(
            path,
            temp_file,
            config,
        )
        csv_items.append((temp_file.name, csv_config))

if hdf5_config._hdf5_config.run_name != "":
    ...

import_services = []
for filename, csv_config in csv_items:
    ...

return import_services

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With NamedTemporaryFile, once you exit the with block, the temp file isn't just closed, but also deleted, so you wouldn't be able to upload the CSV files with the approach above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Might be more clear if the comment says that. Something like "Use ExitStack so that temporary files are not removed until all files have been processed and uploaded".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcsiftstack marcsiftstack self-requested a review July 16, 2025 02:55
marcsiftstack

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants