Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File hashes #84

Closed
chrisiacovella opened this issue Mar 22, 2024 · 2 comments · Fixed by #91
Closed

File hashes #84

chrisiacovella opened this issue Mar 22, 2024 · 2 comments · Fixed by #91
Labels
enhancement New feature or request

Comments

@chrisiacovella
Copy link
Member

When we got to train, we will load up a curated dataset, creating local files which we cache for future use. We need to store/compare the hash of these files to ensure that we are not working with an incorrect or partially generated file. This should be trivial for the gzipped hdf5 files, but the npz files we generate locally, we will need to probably generate some metadata file after it is created that stores the hash, since this hash is not known beforehand.

@chrisiacovella
Copy link
Member Author

The general sequence of the data loader is

  • download the .hdf5.gz file
  • unzip the .hdf5.gz file
  • load the .hdf5 file
  • save the .hdf5 file as an .npz file
  • load the .hdf5 file as an .npz file

Ideally, we want to use the .npz file if it exists, and skip all the rest. If the .npz doesn't exist, we should check to see if the .hdf5 file exists. If not, we will check to see if the .hdf5.gz exists. If not, we will download. However, we can't just rely on seeing if a file exists, we need to make sure that it is the correct file.

A few changes I've been making in PR #91 .

  • Rather than using a fully generic filename for the various temporary files (i.e., not using "cached.npz" or something that doesn't tell us which dataset we have) each file has a unique name defined in the data loader (so, something like "qm9_dataset_processed.npz", "qm9_dataset.hdf5", "qm9_dataset.hdf5.gz").

  • While giving dataset specific names helps, we also need to validate the checksums of the files. The checksums are encoded in the data loader along with the filename. For example, if we call _download we will check to see if the .hdf5.gz exists, and if it does, we will then compare the checksum against the expected checksum. If the checksum doesn't match, we will know we need to download again. Note, we can also tell the code to force_download the file. Similarly, we logged the known checksum for the .hdf5 file.

  • An issue comes up when dealing with the npz files, as it appears the checksum may be different with different platforms. Furthermore, the datafile itself might change, if a different set of "properties of interest" are selected. To handle this, when writing the .npz file, we also write a .json file that contains some metadata, including the checksums of the .hdf5 and .hdf5.gz files used to generate the data, the data_keys used (i.e., properties of interest), as well as the date the .npz file was generated. We will check to ensure that the checksum of the .hdf5 file used to generated the .npz file matches the checksum in the data loader (to ensure that the code itself hasn't been updated). We also check to ensure that the properties of interest match those defined in the data loader. If all of these match, we can be reasonable safe in our loading. Note since the .json file is generated AFTER the npz file is written, we will only be able to read this in if the file is properly written.

@wiederm wiederm linked a pull request May 2, 2024 that will close this issue
1 task
@wiederm
Copy link
Member

wiederm commented May 2, 2024

This has been addressed in the linked PRs. Closing for now.

@wiederm wiederm closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants