-
Notifications
You must be signed in to change notification settings - Fork 3
python(feat): Add HDF5 upload service #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
if non_string_config_dict["data"]: | ||
filtered_hdf5_configs.append(Hdf5Config(non_string_config_dict)) | ||
|
||
for data_cfg in hdf5_config._hdf5_config.data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to combine all of the string channels into a single separate config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. Since we have to assume unique timestamps between the different string channels, a single CSV file containing the string channels can end up including null data points (",," in CSV) which get ingested as valid data points containing an "" string. It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value.
We could potentially group string channels using the same timestamp "source", but I think that overcomplicates the code, versus considering a future improvement where we ingest these values using a more flexible method than CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a downside of the CSV ingestion method which doesn't differentiate between an empty string and a null value
Do you know if this is intentional? If not, can you open a ticket to see if we can fix this? Fine with this implementation in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket opened for future work
# First convert each csv file | ||
csv_items: List[Tuple[str, CsvConfig]] = [] | ||
for config in split_configs: | ||
temp_file = stack.enter_context(NamedTemporaryFile(mode="wt", suffix=".csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"t"
is the default so it's not needed here. We use "wt"
in some other places when writing to a compressed file because text mode is not the default with gzip.open
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import_services = [] | ||
for config in split_configs: | ||
with NamedTemporaryFile(mode="w", suffix=".csv") as temp_file: | ||
# Ensures all temp files opened under stack.enter_context() will have __exit__ called as with a standard with statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why we need to useExitStack
. Is there a reason why this doesn't work with the standard with
syntax?
csv_items: List[Tuple[str, CsvConfig]] = []
for config in split_configs:
with NamedTemporaryFile(mode="wt", suffix=".csv") as temp_file:
csv_config = _convert_to_csv_file(
path,
temp_file,
config,
)
csv_items.append((temp_file.name, csv_config))
if hdf5_config._hdf5_config.run_name != "":
...
import_services = []
for filename, csv_config in csv_items:
...
return import_services
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With NamedTemporaryFile
, once you exit the with block, the temp file isn't just closed, but also deleted, so you wouldn't be able to upload the CSV files with the approach above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Might be more clear if the comment says that. Something like "Use ExitStack so that temporary files are not removed until all files have been processed and uploaded".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds an
Hdf5UploadService
to allow ingestion of HDF5 files. Also adds anHdf5Config
used to define the incoming data.Uses the following config format:
Verification
See included unit testing.
Example includes ingestion of a sample hdf5 file
HDFView of file

Uploaded to Sift
