Extending and updating curation sets. #74

chrisiacovella · 2024-03-06T08:13:54Z

Description

This will add new datasets into model forge, including full spice 1.1.4, preliminary spice 2, ani2x, and the test dataset.

Notes:

Currently, the spice openff code only pulls the pubchem datapoints from qcarchive; the full dataset will be included in the updated code. As part of this, the hdf5 file will get an additional optional source attribute for dataset, to allow for future filtering (as mentioned in Adding model loaders for SPICE, ANI1x and ANI2x #65). This file will also write out forces, rather than gradients, again as mentioned in Adding model loaders for SPICE, ANI1x and ANI2x #65.
As we have been started to discuss in Difference between atomic self energies from regression and from literature #72 , we do not necessarily need formation energies within the datasets. The linear regression scheme, or using a dictionary of values can be done in the dataset pre-processing.

Todos

Notable points that this PR has either accomplished or will accomplish.

Ani2x
ANi2x tests
spice 2
spice 2 tests
full spice 1.1.4
Add unit tests for new datasets
updated wiki with dataset/hdf5 format description.
upon discussion, add field that reports total molecule charge (needed for spice that allow for non-neutral molecules).

Status

Ready to go

codecov-commenter · 2024-03-06T08:15:33Z

Codecov Report

Merging #74 (41f6653) into main (342c5ed) will increase coverage by 8.90%.
The diff coverage is 94.14%.

Additional details and impacted files

chrisiacovella · 2024-03-06T19:44:01Z

The wiki has been updated with a lot of examples and discussion about the hdf5 file format and underlying "data" datastructure passed to the hdf5 file.

https://github.com/choderalab/modelforge/wiki/Dataset-and-curation

…Made generic extraction function for tarred and compressed files.

…ue to changes in qcarchive.

…ets do not have the appropriate calculations and were removed). changed logic in sorting of records and joining conformers due to inconsistencies in the naming scheme between datasets.

…theory available).

…ested.

chrisiacovella · 2024-03-12T05:12:17Z

I still need to implement the test data set. I had to rerun some calculations.

… hash, as some records do not provide this in the headers (or in the same consistent place). Also, some records do not have the length annotated; routines have been added to ensure we don't have an error if we don't know the length (this is only used for the tqdm download bar, so it is not essential). Additional tests added for these. Also changed zenodo and figshare helpers to compare the checksum even if the file exists (will download if they don't match; this will help us avoid using a partially downloaded file or a file with the same name, but wrong content).

chrisiacovella · 2024-03-16T14:51:20Z

There appears to be another change in the naming scheme in one of the datasets (Processing SPICE DES370K Single Points Dataset); I need to add in some regex searching to identify this different naming convention and skip all the sorting by conformers ids.

…o spice datasets

modelforge/curation/ani2x_curation.py

modelforge/curation/spice_114_curation.py

modelforge/tests/test_remote.py

modelforge/utils/remote.py

modelforge/tests/test_curation.py

wiederm

Great work! It's exciting to see how much progress has been made, and I can't wait to start training the models with these new datasets.

…ation

Added in ANI2x curation

ffd71ef

chrisiacovella added enhancement New feature or request WIP Work in Progress labels Mar 6, 2024

chrisiacovella added 7 commits March 6, 2024 15:04

Added tests for ani2x curation, including minimal dataset of dimers. …

4ad9c00

…Made generic extraction function for tarred and compressed files.

fixed cut and paste typo in test_curation

efe2df0

Modified SPICE openff curation; fixed spice tests that were failing d…

719b5a0

…ue to changes in qcarchive.

Updated spice openff curation to do the "full" dataset (note two subs…

25c1bd8

…ets do not have the appropriate calculations and were removed). changed logic in sorting of records and joining conformers due to inconsistencies in the naming scheme between datasets.

Updated docstrings in spice_openff_curation.py

70cc934

Preliminary implementation of SPICE 2 curation (note no openff level …

5dbc0b5

…theory available).

Moved key sorting to a separate function to allow it to be directly t…

5ee8b2c

…ested.

chrisiacovella requested a review from wiederm March 14, 2024 15:34

chrisiacovella added 2 commits March 14, 2024 11:19

Fixed variable renaming in ani2x for md5 hash sent to zenodo.

363dda5

chrisiacovella and others added 5 commits March 20, 2024 21:32

Fixed name issues in spice 2. all dataset curation sets run.

f54f4d1

Updated test for spice 2 renaming

53d1d33

Added in calculations and test to report total charge of a molecule t…

cf3c4a2

…o spice datasets

Merge branch 'main' into expand_curation

d5b8eda

Merge branch 'main' into expand_curation

dbc37f4