Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset caching overhaul and additional datasets #91

Merged
merged 46 commits into from
May 2, 2024

Conversation

chrisiacovella
Copy link
Member

Description

This PR will focus on two key things:

  • revamping the dataloader classes to be "Safer" in terms of cached files (checking the MD5 checksum, better naming practices, etc). This refers to issue: File hashes #84
  • adding in dataloaders for other datasets (ANI2x, Spice, etc) so the curated datasets can be used in training.

Todos

Notable points that this PR has either accomplished or will accomplish.

  • [ ]

Status

  • Ready to go

@chrisiacovella chrisiacovella added enhancement New feature or request WIP Work in Progress labels Mar 29, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 38.51675% with 257 lines in your changes are missing coverage. Please review.

Project coverage is 80.34%. Comparing base (2d6380d) to head (51f5148).
Report is 2 commits behind head on main.

Additional details and impacted files

@chrisiacovella
Copy link
Member Author

At this point still cannot train with this dataset; need to resolve some issues related to default properties. Currently causes an error due to Q not being defined in ani2x (logic needs to check if "none" is defined...started changing datastructure for properties).

@chrisiacovella chrisiacovella mentioned this pull request Apr 11, 2024
@chrisiacovella
Copy link
Member Author

Copying from the issue #84:

The general sequence of loading data is

  • download the .hdf5.gz file
  • unzip the .hdf5.gz file
  • load the .hdf5 file
  • save the .hdf5 file as an .npz file
  • load the .hdf5 file as an .npz file

Ideally, we want to use the .npz file if it exists, and skip all the rest. If the .npz doesn't exist, we should check to see if the .hdf5 file exists. If not, we will check to see if the .hdf5.gz exists. If not, we will download. However, we can't just rely on seeing if a file exists, we need to make sure that it is the correct file.

A few changes I've been making here:

  • Rather than using a fully generic filename for the various temporary files (i.e., not using "cached.npz" or something that doesn't tell us which dataset we have) each file has a unique name defined in the data loader (so, something like "qm9_dataset_processed.npz", "qm9_dataset.hdf5", "qm9_dataset.hdf5.gz").

  • While giving dataset specific names helps, we also need to validate the checksums of the files. The checksums are encoded in the data loader along with the filename. For example, if we call _download we will check to see if the .hdf5.gz exists, and if it does, we will then compare the checksum against the expected checksum. If the checksum doesn't match, we will know we need to download again. Note, we can also tell the code to force_download the file. Similarly, we logged the known checksum for the .hdf5 file.

  • An issue comes up when dealing with the npz files, as it appears the checksum may be different with different platforms. Furthermore, the datafile itself might change, if a different set of "properties of interest" are selected. To handle this, when writing the .npz file, we also write a .json file that contains some metadata, including the checksums of the .hdf5 and .hdf5.gz files used to generate the data, the data_keys used (i.e., properties of interest), as well as the date the .npz file was generated. We will check to ensure that the checksum of the .hdf5 file used to generated the .npz file matches the checksum in the data loader (to ensure that the code itself hasn't been updated). We also check to ensure that the properties of interest match those defined in the data loader. If all of these match, we can be reasonable safe in our loading. Note since the .json file is generated AFTER the npz file is written, we will only be able to read this in if the file is properly written.

@chrisiacovella chrisiacovella requested a review from wiederm April 12, 2024 17:07
@chrisiacovella
Copy link
Member Author

Thanks for tackling this @chrisiacovella !

How do you resolve the path where the cached files are stored? Are these in userspace or in a scratch/tmp directory?

Each Dataset can be initialized by setting the "local_cache_dir" which will specify where the datafiles end up.

@chrisiacovella
Copy link
Member Author

@wiederm the equivalent tests seem to be failing for sake on macOS python 3.10. I resolved these issues earlier by seeding the random number generator in the equivalence_test_utils. It appears you commented this out (so the tests may not be identical each time). I think there might be an issue with using Euler angles to generate the rotation, rather than quaternions.

…s (e.g., limiting to max of 10 per record, for a total of 1000, for unit testing).
…sing to have my improved class that uses units.
…ts tested in test_models.py to qm9 and ani2x test sets.
…ts tested in test_models.py to qm9 and ani2x test sets.
@chrisiacovella
Copy link
Member Author

Ok, tests are all set up now to handle looping over multiple datasets. I had to cut down a few of the datasets for the test_models; I think we were running out of memory on some of the tests on CI, as they were passing locally (the point is to test the NNP, not the dataset; I still keep qm9 and ani2x so we have some variety).

@chrisiacovella chrisiacovella merged commit 02f53c7 into choderalab:main May 2, 2024
5 checks passed
@chrisiacovella chrisiacovella deleted the caching_and_more_data branch July 30, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants