-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between atomic self energies from regression and from literature #72
Comments
I'm (mostly) copying my comment from the PR: I wouldn't say these are really all that off. All but hydrogen have less than a 1% difference (note I'm regressing the entire QM9 dataset). Hydrogen actually has the smallest numerical difference (i.e., delta), but because it is nearly 2 orders of magnitude smaller than the other elements, it just seems more significant. All values in kJ/mol
But since the regressed values are all consistently higher than the DFT values, if we use them we are going to end up with positive values for small molecules and much smaller energies for large molecules. That is, the differences are in fact large relative to the scale of the formation energy (which seems to be closer to the scale of the hydrogen atom for a random sampling of molecules from qm9). But that might just be as accurate as one can get give the scale of the input energies being 10^6 kJ/mol. A few quick calculations of the formation energy to demonstrate differences. Methane formation energy Ammonia formation energy: water: acetylene: CCC(=O)C=O Two questions:
|
Following up on our discussion today: we should test this! Does it have any impact on the accuracy of the trained model? On the training time? On the stability? We could also take the DFT values and make uniform perturbations: increase/decrease by 5%, 10%, 20%, etc. What if all these values were off by an order of magnitude (changing the order of the "computed" energy)? If model training is fast, this would be an easy set of tests to determine how sensitive training is to the values used and whether this is something to be very concerned about. Ultimately, using DFT calculated values is probably the most rigorous/straight forward and should not in any way be an expensive calculation, but if these values are not available, this could be a substantial roadblock. I think if we can determine the degree to how much these values influence the results, we could some very clear suggestions as what to do (e.g., if this does not make much of a difference, one can likely use any reasonable literature values, even if calculated with a different engine or even different level of theory). |
Additional comment regarding linear regression. @wiederm noted that for a larger dataset, like ani2x, we will likely run into issues with being able to hold it fully load it in memory or too many datapoints to efficiently fit. Based on the preliminary fitting with QM9 using 100 data points vs. the entire dataset, we likely will not need to use the entire dataset to get reasonable estimates, but at the same time, how we choose this subset may be very important. I think there are two considerations when picking out a smaller subset to regress:
To satisfy these two requirements I think we can do the following.
Repeat for each batch, generating a more manageable array for fitting that ensures we are reasonably sampling the dataset. Rather than sorting a list, we could also generate a histogram and then do the sampling based upon stdevs from the mean, but that is probably not necessary. |
Back to the point as to whether the values are "right", these are the ANI2x regressed values: https://github.com/isayev/ASE_ANI/blob/master/ani_models/ani-2x_8x/sae_linfit.dat To directly compare, I converted to kJ/mol and truncating to 2 decimal places, appending to the table I shared above. ani2x and qm9 use the same level of theory, so this comparison makes sense.
Condensed to just the energies:
I'd say overall the results are consist for the regression of the datasets and again, not too far off from the DFT values. This just might be the limits of accuracy of doing a regression. It might be good to actually do this in batches, doing the fitting for each batch, and getting a mean value for the self energy (to be able to also assess variability in the fitting itself). |
We need to double-check the source of the relatively significant difference in the atomic self-energies of the QM9 dataset.
The text was updated successfully, but these errors were encountered: