Bug in PCQM4M's scaffold split #162
Yelrose
started this conversation in
PCQM4M-LSC
Replies: 1 comment 2 replies
-
Hi! Great catch. Indeed, the provided split is not the scaffold split, which is different from our original intention. As we investigate our processing code, there is indeed a bug that causes the split to be done based on PubChem CID, rather than the scaffold structure. The split might induce some systematic OOD scenarios due to the indexing of PubChem. We are internally running experiments on the correct scaffold split to see if there is any different trend in model performance. As our prediction task is still valid, we are most likely keeping the split we currently have---to avoid confusion and test label leakage. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are exploring the data distribution using the following code to get the scaffold. But we find that there exit some smiles shared the same scaffold in both train and test set.
Beta Was this translation helpful? Give feedback.
All reactions