Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data conversion scripts. #272

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Added data conversion scripts. #272

wants to merge 10 commits into from

Conversation

frobnitzem
Copy link
Collaborator

@frobnitzem frobnitzem commented Aug 2, 2024

These scripts outline the general format for working will all csv file types.

I will add to this PR as I test these general scripts on multiple data sources.

@frobnitzem
Copy link
Collaborator Author

yaml_to_config.py needs checking to ensure that it's selecting the right variables for compatibility with other datasets used to train the foundational model, and also to check the indexing it uses for graph output properties.

Specifically, it now contains:

      "input_node_feature_names": [
        "atomC",
        "atomF",
        "atomH",
        "atomN",
        "atomO",
        "atomS",
        "atomHg",
        "atomCl",
        "atomicnumber",
        "IsAromatic",
        "HSP",
        "HSP2",
        "HSP3",
        "Hprop"
      ],

But likely we can't use all these. How should yaml_to_config be modified?

@frobnitzem
Copy link
Collaborator Author

frobnitzem commented Aug 14, 2024

Several steps need to be done before this is ready to merge:

  • implement config.json reader according to data.md
    • check that these files are correctly generated by the present PR (utils/yaml_to_config.py)
    • update the output head configuration (see below)
    • test data loading from AdiosDataset with new feature selectors
  • add classification loss head type to HydraGNN
    use sigmoid as activation function at the last layer (only needed if we end with an activation, since torch's cross-entropy converts (-inf, inf) to logits.
    use binary cross entropy as loss function for training, validation, and testing
  • update data loading (from AdiosDataset) so that metadata is used to target which features go into x and y (need to select only some columns when going from data file to in-memory format)

@frobnitzem
Copy link
Collaborator Author

Documenting the new format

  1. List all relevant functions encountered during config loading and parsing.
  2. Check steps that process the config object against the previous and new formats we are using – what needs to be changed?
  3. Develop hand-written examples of the new config.json files we expect. Compare old and new formats side-by-side and explain the changes.
  4. Write a pydantic BaseModel and validate our new config.json files.
    • This can now be used to fix yaml_to_config.py and create config.json files for all our downstream tasks.

Implementing the new format

  1. Write down a list of all old variable names that existed in config.json before, but which are now going to be obsolete.
  2. Write and test the main functions inside HydraGNN that will load the new config format.
  • involves updating the data loading step to make use of feature indices now present
  • involves updating the loss computation step
  1. Search through HydraGNN's code and remove all these variable references. This will require testing against existing HydraGNN examples to ensure that functionality is not lost or broken.
  2. Make a list of all examples that use the old config.json format. (to document future work that needs to be done without getting side-tracked doing it immediately)
  3. Incorporate our pydantic.BaseModel for loading config.json into HydraGNN.

After completing the above (new issue/PR)

  1. Run downstream task training.
  2. Run a foundation model task training (using a smaller dataset like qm7x).
    • This will require rebuilding the dataset to include our metadata about features.
  3. Run a hybrid training
    • this will be easier using the new config and data formats BUT will require re-configuring output heads and creating x,y,z values (maybe all zero-s) for input molecule.

* Added get_edge_attribute_name to smiles_utils

* Bugfix for returning 1-hot element names in smiles_utils/graph
  generation

* Made it possible to skip 1-hot element encoding in smiles_utils/graph
  generation

* created TODO list in yaml_to_config.py
Zach Fox and others added 2 commits September 23, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant