Dataset caching overhaul and additional datasets #91

chrisiacovella · 2024-03-29T16:48:36Z

Description

This PR will focus on two key things:

revamping the dataloader classes to be "Safer" in terms of cached files (checking the MD5 checksum, better naming practices, etc). This refers to issue: File hashes #84
adding in dataloaders for other datasets (ANI2x, Spice, etc) so the curated datasets can be used in training.

Todos

Notable points that this PR has either accomplished or will accomplish.

[ ]

Status

Ready to go

codecov-commenter · 2024-03-29T16:51:37Z

Codecov Report

Attention: Patch coverage is 38.51675% with 257 lines in your changes are missing coverage. Please review.

Project coverage is 80.34%. Comparing base (2d6380d) to head (51f5148).
Report is 2 commits behind head on main.

Additional details and impacted files

chrisiacovella · 2024-04-05T08:08:11Z

At this point still cannot train with this dataset; need to resolve some issues related to default properties. Currently causes an error due to Q not being defined in ani2x (logic needs to check if "none" is defined...started changing datastructure for properties).

…ct shape

…ing_and_more_data

…e. atomic self energies now are defined with energy units (but returned without units, in our base unit system when used in removing self-energy). ANI2x now loads

… python version used to generate them causing issues.

…ing_and_more_data

chrisiacovella · 2024-04-11T21:45:26Z

Copying from the issue #84:

The general sequence of loading data is

download the .hdf5.gz file
unzip the .hdf5.gz file
load the .hdf5 file
save the .hdf5 file as an .npz file
load the .hdf5 file as an .npz file

Ideally, we want to use the .npz file if it exists, and skip all the rest. If the .npz doesn't exist, we should check to see if the .hdf5 file exists. If not, we will check to see if the .hdf5.gz exists. If not, we will download. However, we can't just rely on seeing if a file exists, we need to make sure that it is the correct file.

A few changes I've been making here:

Rather than using a fully generic filename for the various temporary files (i.e., not using "cached.npz" or something that doesn't tell us which dataset we have) each file has a unique name defined in the data loader (so, something like "qm9_dataset_processed.npz", "qm9_dataset.hdf5", "qm9_dataset.hdf5.gz").
While giving dataset specific names helps, we also need to validate the checksums of the files. The checksums are encoded in the data loader along with the filename. For example, if we call _download we will check to see if the .hdf5.gz exists, and if it does, we will then compare the checksum against the expected checksum. If the checksum doesn't match, we will know we need to download again. Note, we can also tell the code to force_download the file. Similarly, we logged the known checksum for the .hdf5 file.
An issue comes up when dealing with the npz files, as it appears the checksum may be different with different platforms. Furthermore, the datafile itself might change, if a different set of "properties of interest" are selected. To handle this, when writing the .npz file, we also write a .json file that contains some metadata, including the checksums of the .hdf5 and .hdf5.gz files used to generate the data, the data_keys used (i.e., properties of interest), as well as the date the .npz file was generated. We will check to ensure that the checksum of the .hdf5 file used to generated the .npz file matches the checksum in the data loader (to ensure that the code itself hasn't been updated). We also check to ensure that the properties of interest match those defined in the data loader. If all of these match, we can be reasonable safe in our loading. Note since the .json file is generated AFTER the npz file is written, we will only be able to read this in if the file is properly written.

chrisiacovella · 2024-04-22T20:39:39Z

Thanks for tackling this @chrisiacovella !

How do you resolve the path where the cached files are stored? Are these in userspace or in a scratch/tmp directory?

Each Dataset can be initialized by setting the "local_cache_dir" which will specify where the datafiles end up.

…ing_and_more_data

chrisiacovella · 2024-04-23T05:52:36Z

@wiederm the equivalent tests seem to be failing for sake on macOS python 3.10. I resolved these issues earlier by seeding the random number generator in the equivalence_test_utils. It appears you commented this out (so the tests may not be identical each time). I think there might be an issue with using Euler angles to generate the rotation, rather than quaternions.

…ers to Apple Silicon

…s (e.g., limiting to max of 10 per record, for a total of 1000, for unit testing).

…d test_dataset.py tests

…sing to have my improved class that uses units.

…l datasets at the current moment).

…ts tested in test_models.py to qm9 and ani2x test sets.

chrisiacovella · 2024-05-02T19:55:08Z

Ok, tests are all set up now to handle looping over multiple datasets. I had to cut down a few of the datasets for the test_models; I think we were running out of memory on some of the tests on CI, as they were passing locally (the point is to test the NNP, not the dataset; I still keep qm9 and ani2x so we have some variety).

Added in download_from_url function to remote that checks md5 checksum.

bb38683

chrisiacovella added enhancement New feature or request WIP Work in Progress labels Mar 29, 2024

chrisiacovella added 5 commits April 4, 2024 07:56

Checksum checking is in place for each step. Dataset prep logic updated.

142a53e

ensuring local_cache_dir is being used in dataset.py

430121b

Adding tests for caching

446afd0

preliminary ani2x dataloader.

a92511c

Merge branch 'main' into caching_and_more_data

e3ae22a

chrisiacovella added 14 commits April 5, 2024 11:55

If Q or F are not defined, we will initialize with zeros of the corre…

e75d0f7

…ct shape

Merge remote-tracking branch 'origin/caching_and_more_data' into cach…

5183b9e

…ing_and_more_data

Q didn't have the right shape when not define, now has the right shap…

09bd863

…e. atomic self energies now are defined with energy units (but returned without units, in our base unit system when used in removing self-energy). ANI2x now loads

Fixed ASE test in test_utils.py

df5e56b

Removed md5 checksum validation of npz files. Checksums cary based on…

d0d8d94

… python version used to generate them causing issues.

updating tests for caching.

7800a6b

Adding in forces to spice datasets.

bf58d7d

minor changes to data fetching script

bb53163

Added curation script for model system.

ec5cd73

Merge branch 'main' into caching_and_more_data

2e90d07

Adding in forces to spice datasets.

ce54cc8

Merge remote-tracking branch 'origin/caching_and_more_data' into cach…

9da94c8

…ing_and_more_data

Added in model dataset.

83ab6e1

Adding in additional models

2abbe24

chrisiacovella mentioned this pull request Apr 11, 2024

File hashes #84

Closed

chrisiacovella added 3 commits April 11, 2024 20:32

added metadata generation/validation for npz files. Added testing

e1c40a5

Merge branch 'main' into caching_and_more_data

627ea39

typo

0d2f216

chrisiacovella requested a review from wiederm April 12, 2024 17:07

Merge branch 'main' into caching_and_more_data

c05e2c4

chrisiacovella added 3 commits April 22, 2024 14:27

updating based on comments.

e242a21

Merge remote-tracking branch 'origin/caching_and_more_data' into cach…

cb3eaf4

…ing_and_more_data

Accidentically deleted some imports. Readded those in.

de53603

chrisiacovella added 18 commits April 22, 2024 22:54

Merge branch 'main' into caching_and_more_data

201bf4c

removed euler angles; put in quaterions for rotational invariance tests.

aed99d7

Adding in skipping of training testing on Mac OS do to change of runn…

51f5148

…ers to Apple Silicon

Modifying curation scripts to allow a fixed number of total conformer…

fb15dc9

…s (e.g., limiting to max of 10 per record, for a total of 1000, for unit testing).

updated training datasets to have consistent total conformers. Update…

4ba6b70

…d test_dataset.py tests

updated curation tests.

9b908f8

Merge branch 'main' into caching_and_more_data

a0d6ccb

merging

0764120

fixing formatting issue from merge.

1af1f2e

AtomicSelfEnergies was moved from utils to processing; updated proces…

8fa058b

…sing to have my improved class that uses units.

Fixed imports for self energies having moved to processing.

d079748

Fixed imports for self energies having moved to processing.

a2af7f1

restricting test_training to qm9 (not all NNPs are compatible with al…

1f6a2dc

…l datasets at the current moment).

restricting test_training to qm9 (not all NNPs are compatible with al…

cec96e7

…l datasets at the current moment).

restricting test_training to qm9 (not all NNPs are compatible with al…

73a36cb

…l datasets at the current moment).

fixing testing issues.

485f8e7

CI tests keep stopping for unknown reasons. Reducing number of datase…

555e9bf

…ts tested in test_models.py to qm9 and ani2x test sets.

CI tests keep stopping for unknown reasons. Reducing number of datase…

0aa23dd

…ts tested in test_models.py to qm9 and ani2x test sets.

chrisiacovella merged commit 02f53c7 into choderalab:main May 2, 2024
5 checks passed

This was linked to issues May 2, 2024

Adding model loaders for SPICE, ANI1x and ANI2x #65

Closed

Adding unit checking/conversion to entry points #32

Closed

File hashes #84

Closed

chrisiacovella deleted the caching_and_more_data branch July 30, 2024 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset caching overhaul and additional datasets #91

Dataset caching overhaul and additional datasets #91

chrisiacovella commented Mar 29, 2024

codecov-commenter commented Mar 29, 2024 •

edited

Loading

chrisiacovella commented Apr 5, 2024

chrisiacovella commented Apr 11, 2024

chrisiacovella commented Apr 22, 2024

chrisiacovella commented Apr 23, 2024

chrisiacovella commented May 2, 2024

Dataset caching overhaul and additional datasets #91

Dataset caching overhaul and additional datasets #91

Conversation

chrisiacovella commented Mar 29, 2024

Description

Todos

Status

codecov-commenter commented Mar 29, 2024 • edited Loading

Codecov Report

chrisiacovella commented Apr 5, 2024

chrisiacovella commented Apr 11, 2024

chrisiacovella commented Apr 22, 2024

chrisiacovella commented Apr 23, 2024

chrisiacovella commented May 2, 2024

codecov-commenter commented Mar 29, 2024 •

edited

Loading