-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending and updating curation sets. #74
Extending and updating curation sets. #74
Conversation
The wiki has been updated with a lot of examples and discussion about the hdf5 file format and underlying "data" datastructure passed to the hdf5 file. https://github.com/choderalab/modelforge/wiki/Dataset-and-curation |
…Made generic extraction function for tarred and compressed files.
…ue to changes in qcarchive.
…ets do not have the appropriate calculations and were removed). changed logic in sorting of records and joining conformers due to inconsistencies in the naming scheme between datasets.
…theory available).
I still need to implement the test data set. I had to rerun some calculations. |
… hash, as some records do not provide this in the headers (or in the same consistent place). Also, some records do not have the length annotated; routines have been added to ensure we don't have an error if we don't know the length (this is only used for the tqdm download bar, so it is not essential). Additional tests added for these. Also changed zenodo and figshare helpers to compare the checksum even if the file exists (will download if they don't match; this will help us avoid using a partially downloaded file or a file with the same name, but wrong content).
There appears to be another change in the naming scheme in one of the datasets (Processing SPICE DES370K Single Points Dataset); I need to add in some regex searching to identify this different naming convention and skip all the sorting by conformers ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! It's exciting to see how much progress has been made, and I can't wait to start training the models with these new datasets.
Description
This will add new datasets into model forge, including full spice 1.1.4, preliminary spice 2, ani2x, and the test dataset.
Notes:
Todos
Notable points that this PR has either accomplished or will accomplish.
Status