Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending and updating curation sets. #74

Merged
merged 18 commits into from
Mar 29, 2024

Conversation

chrisiacovella
Copy link
Member

@chrisiacovella chrisiacovella commented Mar 6, 2024

Description

This will add new datasets into model forge, including full spice 1.1.4, preliminary spice 2, ani2x, and the test dataset.

Notes:

Todos

Notable points that this PR has either accomplished or will accomplish.

  • Ani2x
  • ANi2x tests
  • spice 2
  • spice 2 tests
  • full spice 1.1.4
  • Add unit tests for new datasets
  • updated wiki with dataset/hdf5 format description.
  • upon discussion, add field that reports total molecule charge (needed for spice that allow for non-neutral molecules).

Status

  • Ready to go

@chrisiacovella chrisiacovella added enhancement New feature or request WIP Work in Progress labels Mar 6, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2024

Codecov Report

Merging #74 (41f6653) into main (342c5ed) will increase coverage by 8.90%.
The diff coverage is 94.14%.

Additional details and impacted files

@chrisiacovella
Copy link
Member Author

The wiki has been updated with a lot of examples and discussion about the hdf5 file format and underlying "data" datastructure passed to the hdf5 file.

https://github.com/choderalab/modelforge/wiki/Dataset-and-curation

…Made generic extraction function for tarred and compressed files.
…ets do not have the appropriate calculations and were removed). changed logic in sorting of records and joining conformers due to inconsistencies in the naming scheme between datasets.
@chrisiacovella
Copy link
Member Author

I still need to implement the test data set. I had to rerun some calculations.

@chrisiacovella chrisiacovella requested a review from wiederm March 14, 2024 15:34
… hash, as some records do not provide this in the headers (or in the same consistent place). Also, some records do not have the length annotated; routines have been added to ensure we don't have an error if we don't know the length (this is only used for the tqdm download bar, so it is not essential). Additional tests added for these. Also changed zenodo and figshare helpers to compare the checksum even if the file exists (will download if they don't match; this will help us avoid using a partially downloaded file or a file with the same name, but wrong content).
@chrisiacovella
Copy link
Member Author

There appears to be another change in the naming scheme in one of the datasets (Processing SPICE DES370K Single Points Dataset); I need to add in some regex searching to identify this different naming convention and skip all the sorting by conformers ids.

Copy link
Member

@wiederm wiederm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! It's exciting to see how much progress has been made, and I can't wait to start training the models with these new datasets.

@chrisiacovella chrisiacovella removed the WIP Work in Progress label Mar 28, 2024
@chrisiacovella chrisiacovella merged commit 6cf7b44 into choderalab:main Mar 29, 2024
6 checks passed
@chrisiacovella chrisiacovella deleted the expand_curation branch July 30, 2024 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants