Added data conversion scripts. #272

frobnitzem · 2024-08-02T00:07:09Z

These scripts outline the general format for working will all csv file types.

I will add to this PR as I test these general scripts on multiple data sources.

frobnitzem · 2024-08-05T21:46:44Z

yaml_to_config.py needs checking to ensure that it's selecting the right variables for compatibility with other datasets used to train the foundational model, and also to check the indexing it uses for graph output properties.

Specifically, it now contains:

      "input_node_feature_names": [
        "atomC",
        "atomF",
        "atomH",
        "atomN",
        "atomO",
        "atomS",
        "atomHg",
        "atomCl",
        "atomicnumber",
        "IsAromatic",
        "HSP",
        "HSP2",
        "HSP3",
        "Hprop"
      ],

But likely we can't use all these. How should yaml_to_config be modified?

frobnitzem · 2024-08-14T19:23:21Z

Several steps need to be done before this is ready to merge:

implement config.json reader according to data.md
- check that these files are correctly generated by the present PR (utils/yaml_to_config.py)
- update the output head configuration (see below)
- test data loading from AdiosDataset with new feature selectors
add classification loss head type to HydraGNN
use sigmoid as activation function at the last layer (only needed if we end with an activation, since torch's cross-entropy converts (-inf, inf) to logits.
use binary cross entropy as loss function for training, validation, and testing
update data loading (from AdiosDataset) so that metadata is used to target which features go into x and y (need to select only some columns when going from data file to in-memory format)

frobnitzem · 2024-09-20T20:33:41Z

Documenting the new format

List all relevant functions encountered during config loading and parsing.
Check steps that process the config object against the previous and new formats we are using – what needs to be changed?
Develop hand-written examples of the new config.json files we expect. Compare old and new formats side-by-side and explain the changes.
Write a pydantic BaseModel and validate our new config.json files.
• This can now be used to fix yaml_to_config.py and create config.json files for all our downstream tasks.

Implementing the new format

Write down a list of all old variable names that existed in config.json before, but which are now going to be obsolete.
Write and test the main functions inside HydraGNN that will load the new config format.

involves updating the data loading step to make use of feature indices now present
involves updating the loss computation step

Search through HydraGNN's code and remove all these variable references. This will require testing against existing HydraGNN examples to ensure that functionality is not lost or broken.
Make a list of all examples that use the old config.json format. (to document future work that needs to be done without getting side-tracked doing it immediately)
Incorporate our pydantic.BaseModel for loading config.json into HydraGNN.

After completing the above (new issue/PR)

Run downstream task training.
Run a foundation model task training (using a smaller dataset like qm7x).
- This will require rebuilding the dataset to include our metadata about features.
Run a hybrid training
- this will be easier using the new config and data formats BUT will require re-configuring output heads and creating x,y,z values (maybe all zero-s) for input molecule.

* Added get_edge_attribute_name to smiles_utils * Bugfix for returning 1-hot element names in smiles_utils/graph generation * Made it possible to skip 1-hot element encoding in smiles_utils/graph generation * created TODO list in yaml_to_config.py

added positions for csv-to-adios pipeline

kshitij-v-mehta mentioned this pull request Aug 14, 2024

Clintox dataset example #271

Closed

frobnitzem added 8 commits September 23, 2024 12:58

Added data conversion scripts.

5628880

Successful import_csv and yaml_to_config.

0208c16

Added documentation and dataset model validator.

a4f57ab

Updates for dataset ingestion.

cdb2d85

* Added get_edge_attribute_name to smiles_utils * Bugfix for returning 1-hot element names in smiles_utils/graph generation * Made it possible to skip 1-hot element encoding in smiles_utils/graph generation * created TODO list in yaml_to_config.py

Updated import_csv with additional input validation and pq read ability.

3bd9cd6

Updated train.py

e3ad170

Added config.py parser.

0434886

Moved Training out of NeuralNetwork in config.

7c3ecf3

frobnitzem force-pushed the conversion branch from 8a8fd50 to 7c3ecf3 Compare September 23, 2024 17:00

Zach Fox and others added 2 commits September 23, 2024 15:44

added positions

6644e25

Merge pull request #286 from zachfox/fox/add_positions

9097721

added positions for csv-to-adios pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added data conversion scripts. #272

Added data conversion scripts. #272

frobnitzem commented Aug 2, 2024 •

edited

Loading

frobnitzem commented Aug 5, 2024

frobnitzem commented Aug 14, 2024 •

edited

Loading

frobnitzem commented Sep 20, 2024

Added data conversion scripts. #272

Are you sure you want to change the base?

Added data conversion scripts. #272

Conversation

frobnitzem commented Aug 2, 2024 • edited Loading

frobnitzem commented Aug 5, 2024

frobnitzem commented Aug 14, 2024 • edited Loading

frobnitzem commented Sep 20, 2024

Documenting the new format

Implementing the new format

After completing the above (new issue/PR)

frobnitzem commented Aug 2, 2024 •

edited

Loading

frobnitzem commented Aug 14, 2024 •

edited

Loading