Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Subsampling be Recommended? #156

Open
fjclark opened this issue Jan 10, 2025 · 1 comment
Open

Should Subsampling be Recommended? #156

fjclark opened this issue Jan 10, 2025 · 1 comment

Comments

@fjclark
Copy link
Collaborator

fjclark commented Jan 10, 2025

v1.0 states that "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference and its statistical uncertainty." A discussion of subsampling is then given and the checklist states: "Make sure you subsample the data in your free energy estimation protocol".

My questions are:

  1. Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference" true?

  2. Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of statistical uncertainty" true?

  3. Should subsampling be recommended?

I've already started a discussion of this in choderalab/pymbar#545, but wanted to raise it here as I found this confusing when I first read the article. My understanding, given in more detail in the PyMBAR issue, is:

  1. This is not generally true. For example, when discussing bridge sampling, Gelman and Meng, 1998 state "the answer [the optimal weighting function] is easily obtained is when we have independent draws from both $p_o$ and $p_1$; although this assumption is typically violated in practice, it permits useful theoretical explorations and in fact the optimal estimator obtained under this assumption performs rather well in general" (e.g. when we do have correlated samples).

  2. This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?

  3. Subsampling increases the variance of the mean estimate and the variance of the variance estimate, and isn't helpful unless the cost of storing/ using samples is non-negligible (Geyer, 1992 (Section 3.6)), e.g. correlated samples contain additional information (just less than uncorrelated samples) and discarding them is a waste of information.

Maybe @mrshirts @jchodera @ppxasjsm @egallicc can comment? Even if I'm misunderstanding, it would be great to add some more references to clarify things.

Thanks.

@ppxasjsm
Copy link
Contributor

Hi @fjclark,

I haven't thought about this in a while and I may misremember. My understanding is that asymptotically for mean and uncertainly to converge uncorrelated samples are needed at least for WHAM and MBAR. I haven't looked at other estimators in detail.

This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?

From practical experience, I have usually opted for no subsampling for mean and subsampling for uncertainty. I have compared subsampled and not subsampled means from real simulation data at some point and if there are insufficient samples subsampling can make things worse. Maybe it just means we should sample more? Then I would typically favour different replicas over estimated uncertainties.

I can take a look at references again to look at more recent papers on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants