Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5MD Can not read observables/<group>/<observable>/... #4598

Open
PythonFZ opened this issue May 17, 2024 · 11 comments
Open

H5MD Can not read observables/<group>/<observable>/... #4598

PythonFZ opened this issue May 17, 2024 · 11 comments
Labels
defect Format-H5MD hdf5-based H5MD trajectory format

Comments

@PythonFZ
Copy link
Contributor

Expected behavior

The H5MD standard defines two main ways to define observables. Either directly as group below the root observables or as a group specific observable observables/group.

observables
 \-- <observable1>
 |    \-- step: Integer[variable]
 |    \-- time: Float[variable]
 |    \-- value: <type>[variable]
 \-- <group1>
 |    \-- <observable2>
 |         \-- step: Integer[variable]
 |         \-- time: Float[variable]
 |         \-- value: <type>[variable][D][D]

Both should be supported.

Actual behavior

In my experience only the observables after root are supported but observables/group are not. It leads to KeyError: "Unable to synchronously open object (object 'value' doesn't exist)". This seems to be briefly discussed in #4320 but I wanted to give more information here.

Code to reproduce the behavior

I've attached a file licl_100.h5.zip because the cobrotoxin.h5md test file only contains observables/lambda but no group specific observables.

import MDAnalysis as mda

u = mda.Universe.empty(100, n_residues=100, atom_resindex=np.arange(100), trajectory=True)
reader = H5MDReader("licl_100.h5", convert_units=False)
u.trajectory = reader
....

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.7.0
  • Which version of Python (python -V)? 3.10.9
  • Which operating system? Ubuntu 22
@orbeckst orbeckst added the Format-H5MD hdf5-based H5MD trajectory format label May 30, 2024
@orbeckst
Copy link
Member

@edisj @ljwoods2 is this issue a known limitation in our H5MD implementation?

@edisj
Copy link
Contributor

edisj commented May 30, 2024

yes, this is related to point 1) in #4561

basically, /particles can contain arbitrary "atom groups" like /particles/group_1/, /particles/group_2/, etc... that contain the position, velocity, and force data for specific subsets of particles group_1, group_2. If these exist, the /observables group can contain corresponding /observables/group_1/*/value, /observables/group_2/*/value, where * can be whatever you calculate (e.g. total energy, temperature, ...) that corresponds to a calculation over specific atom groups group_1, group_2, ...

At the moment, H5MDReader assumes there is only 1 group in /particles, and all of the observables in /observables correspond to that single particle group

@edisj
Copy link
Contributor

edisj commented May 30, 2024

the same thing is happening in #4320

@orbeckst
Copy link
Member

@PythonFZ how would you expect MDAnalysis to support groups, i.e., what exactly should be happening?

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2024

I could reproduce with

import MDAnalysis as mda
import numpy as np
u = mda.Universe.empty(100, n_residues=100, atom_resindex=np.arange(100), trajectory=True)
u.load_new("licl_100.h5", format="H5MD", convert_units=False)

and failed with

KeyError: "Unable to synchronously open object (object 'value' doesn't exist)"

If this is an otherwise legal H5MD file then we should be able to at least read the particles information, even if we ignore observables. So this may count as a bug in our H5MD reader and I tentatively assign the defect label.

However, I tried reading licl_100.h5 with the reference implementation (read.py from pyh5md)

 python read.py licl_100.h5
Traceback (most recent call last):
  File "~/tmp/h5md/read.py", line 17, in <module>
    with File(args.file, 'r') as f:
         ^^^^^^^^^^^^^^^^^^^^
  File "~/anaconda3/envs/mda311/lib/python3.11/site-packages/pyh5md/h5md_module.py", line 236, in __init__
    raise KeyError("h5md group not found in file")
KeyError: 'h5md group not found in file'

and note that @PythonFZ 's H5MD file cannot be loaded by pyh5md, presumably because it is missing the mandatory h5md root level group (see https://www.nongnu.org/h5md/h5md.html#h5md-root-level ). I don't have time to dissect the h5 file further so could you please comment and demonstrate that the test file does indeed fully conform to the spec?

@edisj
Copy link
Contributor

edisj commented Jun 1, 2024

and note that @PythonFZ 's H5MD file cannot be loaded by pyh5md, presumably because it is missing the mandatory h5md root level group (see https://www.nongnu.org/h5md/h5md.html#h5md-root-level ).

that's interesting... should our implementation also fail immediately if the root h5md group is missing?

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2024

yes — it's mandatory

@PythonFZ
Copy link
Contributor Author

Thanks for looking into this @orbeckst.

This archive md.zip contains two updated files that fulfill the H5MD standard. The obs_root.h5 can be read by MDAanlysis, where the energy is stored in observables/energy. The obs_per_particles.h5 yields the error with the energy being in observables/atoms/energy.

Both files have been created using pyh5md.

how would you expect MDAnalysis to support groups, i.e., what exactly should be happening?

I think it would be beneficial to load the data into Timestep.data just like with the obs_root.h5 file.

@orbeckst
Copy link
Member

Observables in H5MD. Let's assume we have the below:

observables
 \-- <observable1>
 |    \-- step: Integer[variable]
 |    \-- time: Float[variable]
 |    \-- value: <type>[variable]
 \-- <observable2>
 |    \-- step: Integer[variable]
 |    \-- time: Float[variable]
 |    \-- value: <type>[variable][D]
 \-- <group1>
 |    \-- <observable3>
 |         \-- step: Integer[variable]
 |         \-- time: Float[variable]
 |         \-- value: <type>[variable][D][D]
 \-- <observable4>: <type>[]
 \-- ...

@PythonFZ, do you expect to always get ts.data['observable1'], ts.data['observable2'], ts.data['group1/observable3'], ts.data['observable4'] for each timestep step at which the observables have been recorded or would you like to specify which observables are pulled out?

What should happen when, say, group1/observable3 was only recorded every 10 steps? We have then timesteps at which there's no value defined for the observable.

as auxiliaries??

We have solved some of these questions with the auxiliary readers already; for instance, auxiliaries can "hold" a value for steps where it wasn't recorded — the details are in Auxiliaries and trajectories.

Perhaps it makes more sense to create an H5MD AuxReader (which may, under the hood, pull data out of the H5MDReader so that we don't open a trajectory twice) instead of trying to drop everything in ts.data.

The docs for auxiliaries are not very clear for users; at least the ones for the EDRReader and trajectories shows you what it looks like in practice and for h5md it might be

u = mda.Universe(topology, "trj.h5md")
aux = mda.auxiliary.H5MD.H5MDObservablesReader(u.trajectory)  # can pass reader instance!
u.trajectory.add_auxiliary({"obs1": "obervable1",
                            "group1_obs3": "group1/observable3"}, 
                            "group1/observable99": "group1/observable99"},
                           aux)
for ts in u.trajectory:
   print("observable1  = ", ts.aux.obs1)
   print("observable3 (group1) = ", ts.aux.group1_obs3)
   print("observable99 (group 1) = ", ts.aux["group1/observable99"])  # only dict

Would this be of interest?

It's also helpful if you write out code that you would have liked to use because that tells us how we should be structuring our user interface.

(Also, there are no promises that anyone is actually going to work on any of this. But if enough people think it would be useful then it becomes more attractive as a project for someone to pick it up.)

@PythonFZ
Copy link
Contributor Author

I think the flexible timesteps in general are difficult to handle. I'd suggest to do this in two steps:

  1. Allow MDAnalysis to read files with observables/particle/<name>/value. This can be achieved via Skip observables contained in particle groups #4615
  2. Consider the original issue of how observables should be handled. How is this handled for e.g. positions and velocities being dumped at different time intervals?

I'm happy to help with this. Currently, I don't have a use case for the observables - I only want to be able to read the file including cell, position, velocities. Do you think auxiliaries fit by looking at the definition of the two groups?

Particles: The size of the stored data scales linearly with the number of particles under consideration

Observables: The size of stored data is typically independent of the system size

@hmacdope
Copy link
Member

hmacdope commented Jun 18, 2024

Xref: #4615 (review) prototype for just dumping observables of the /group/observable form into ts.data @orbeckst @edisj. Happy to proceed with #4615 as is first while we work on the proper fix.

orbeckst added a commit that referenced this issue Jul 17, 2024
* skip observables contained in particle groups
* This is not a fix for #4598 but enables reading files that contain data in
   
            observables
             \-- <group1>
             |    \-- <observable2>
             |         \-- step: Integer[variable]
             |         \-- time: Float[variable]
             |         \-- value: <type>[variable][D][D]

* add test: open file
* add and test with `malformed.h5md`
* update CHANGELOG
* Update AUTHORS

---------

Co-authored-by: Hugo MacDermott-Opeskin <hugomacdermott@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Format-H5MD hdf5-based H5MD trajectory format
Projects
None yet
Development

No branches or pull requests

4 participants