Bug in PCQM4M's scaffold split #162

Yelrose · 2021-04-20T06:29:38Z

Yelrose
Apr 20, 2021

We are exploring the data distribution using the following code to get the scaffold. But we find that there exit some smiles shared the same scaffold in both train and test set.

from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold

def generate_scaffold(smiles, include_chirality=False):
    """return scaffold string of target molecule"""
    mol = Chem.MolFromSmiles(smiles)
    scaffold = MurckoScaffold\
        .MurckoScaffoldSmiles(mol=mol, includeChirality=include_chirality)
    return scaffold

weihua916 · 2021-04-20T22:55:18Z

weihua916
Apr 20, 2021
Maintainer

Hi! Great catch. Indeed, the provided split is not the scaffold split, which is different from our original intention. As we investigate our processing code, there is indeed a bug that causes the split to be done based on PubChem CID, rather than the scaffold structure. The split might induce some systematic OOD scenarios due to the indexing of PubChem.

We are internally running experiments on the correct scaffold split to see if there is any different trend in model performance. As our prediction task is still valid, we are most likely keeping the split we currently have---to avoid confusion and test label leakage.

2 replies

weihua916 Apr 21, 2021
Maintainer

To get some intuition: Below are comparisons between PubChem ID split (the current split) and the correct scaffold split in terms of the prediction target values and input graph size.

weihua916 Apr 21, 2021
Maintainer

Here is some model performance comparison between our current split and the correct scaffold split. The baseline code is directly adopted. Two observations:

Trend is consistent: GIN-virtual > GCN-virtual > GIN > GCN. In addition, we found smaller GIN-virtual models perform worse, which is consistent with our observation in the paper.
Number-wise, the scaffold split is indeed harder.

Performance on the PubChemID split (currently used)

Model	Valid MAE	Test MAE*	#Parameters	Hardware
GIN	0.1536	0.1678	3.8M	GeForce RTX 2080 (11GB GPU)
GIN-virtual	0.1396	0.1487	6.7M	GeForce RTX 2080 (11GB GPU)
GCN	0.1684	0.1838	2.0M	GeForce RTX 2080 (11GB GPU)
GCN-virtual	0.1510	0.1579	4.9M	GeForce RTX 2080 (11GB GPU)
MLP+Fingerprint	0.2044	0.2068	16.1M	GeForce RTX 2080 (11GB GPU)

Performance on the correct scaffold split

Model	Valid MAE	Test MAE*	#Parameters	Hardware
GIN	0.1670	0.2254	3.8M	GeForce RTX 2080 (11GB GPU)
GIN-virtual	0.1513	0.1975	6.7M	GeForce RTX 2080 (11GB GPU)
GCN	0.1845	0.2499	2.0M	GeForce RTX 2080 (11GB GPU)
GCN-virtual	0.1611	0.2147	4.9M	GeForce RTX 2080 (11GB GPU)
MLP+Fingerprint	0.2446	0.3301	16.1M	GeForce RTX 2080 (11GB GPU)

As the trend is pretty consistent, we believe models developed for our particular split should be transferred well to other split types; hence, it is meaningful to develop models on the current split. We may need to be a bit pessimistic about the actual performance number, keeping in mind that the scaffold split is harder.

One potential explanation why test performance is much worse in the scaffold split might be the big target value shift in the scaffold split (test values significantly biased towards lower HOMO-LUMO gap, as seen in the above figure, upper right). If we look at the model's average prediction values on the scaffold test set, the predicted values are larger on average (simple post-hoc correction does not improve the performance):

Model (on scaffold split)	Avg (predicted) test values	Diff to ground-truth
Ground-truth	5.110	0
GIN	5.158	+0.047
GIN-virtual	5.148	+0.038
GCN	5.152	+0.042
GCN-virtual	5.171	+0.061
MLP+Fingerprint	5.281	+0.171

The diff to ground-truth is much smaller for the current CID split (<0.01), perhaps because the target value shift for the current CID split is much smaller.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in PCQM4M's scaffold split #162

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Bug in PCQM4M's scaffold split #162

Yelrose Apr 20, 2021

Replies: 1 comment · 2 replies

weihua916 Apr 20, 2021 Maintainer

weihua916 Apr 21, 2021 Maintainer

weihua916 Apr 21, 2021 Maintainer

Performance on the PubChemID split (currently used)

Performance on the correct scaffold split

Yelrose
Apr 20, 2021

Replies: 1 comment 2 replies

weihua916
Apr 20, 2021
Maintainer

weihua916 Apr 21, 2021
Maintainer

weihua916 Apr 21, 2021
Maintainer