Should Subsampling be Recommended? #156

fjclark · 2025-01-10T16:15:13Z

v1.0 states that "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference and its statistical uncertainty." A discussion of subsampling is then given and the checklist states: "Make sure you subsample the data in your free energy estimation protocol".

My questions are:

Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference" true?
Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of statistical uncertainty" true?
Should subsampling be recommended?

I've already started a discussion of this in choderalab/pymbar#545, but wanted to raise it here as I found this confusing when I first read the article. My understanding, given in more detail in the PyMBAR issue, is:

This is not generally true. For example, when discussing bridge sampling, Gelman and Meng, 1998 state "the answer [the optimal weighting function] is easily obtained is when we have independent draws from both $p_o$ and $p_1$; although this assumption is typically violated in practice, it permits useful theoretical explorations and in fact the optimal estimator obtained under this assumption performs rather well in general" (e.g. when we do have correlated samples).
This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?
Subsampling increases the variance of the mean estimate and the variance of the variance estimate, and isn't helpful unless the cost of storing/ using samples is non-negligible (Geyer, 1992 (Section 3.6)), e.g. correlated samples contain additional information (just less than uncorrelated samples) and discarding them is a waste of information.

Maybe @mrshirts @jchodera @ppxasjsm @egallicc can comment? Even if I'm misunderstanding, it would be great to add some more references to clarify things.

Thanks.

ppxasjsm · 2025-01-10T17:35:02Z

Hi @fjclark,

I haven't thought about this in a while and I may misremember. My understanding is that asymptotically for mean and uncertainly to converge uncorrelated samples are needed at least for WHAM and MBAR. I haven't looked at other estimators in detail.

This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?

From practical experience, I have usually opted for no subsampling for mean and subsampling for uncertainty. I have compared subsampled and not subsampled means from real simulation data at some point and if there are insufficient samples subsampling can make things worse. Maybe it just means we should sample more? Then I would typically favour different replicas over estimated uncertainties.

I can take a look at references again to look at more recent papers on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should Subsampling be Recommended? #156

Should Subsampling be Recommended? #156

fjclark commented Jan 10, 2025

ppxasjsm commented Jan 10, 2025

Should Subsampling be Recommended? #156

Should Subsampling be Recommended? #156

Comments

fjclark commented Jan 10, 2025

ppxasjsm commented Jan 10, 2025