Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alchemlyb tutorial in google colab #196

Closed
andresilvapimentel opened this issue Jun 7, 2022 · 29 comments
Closed

alchemlyb tutorial in google colab #196

andresilvapimentel opened this issue Jun 7, 2022 · 29 comments

Comments

@andresilvapimentel
Copy link

It would be nice if the developers write an alchemlyb tutorial in google colab using a simple ligand/protein complex example.

@andresilvapimentel
Copy link
Author

I was trying to build a notebook using my multiple xvg files, however I did not get to convert the multiple xvg files (from each lambda) to a dataset similar to benzene_load
How can I perform this task?

@andresilvapimentel
Copy link
Author

In my notebook, It works fine if I use the benzene_load dataset, but I did not get to convert my personal multiple xvg files (from each lambda) into a dataset to work with my notebook.

@xiki-tempula
Copy link
Collaborator

@andresilvapimentel Hi, thank you for your interest in alchemlyb
we have an experimental workflow that should do the calculation automatically and it is in the branch of https://github.com/xiki-tempula/alchemlyb/tree/workf
With an API like

    >>> import os
    >>> from alchemtest.gmx import load_ABFE
    >>> from alchemlyb.workflows import ABFE
    >>> # Obtain the path of the data
    >>> dir = os.path.dirname(load_ABFE()['data']['complex'][0])
    >>> print(dir)
    'alchemtest/gmx/ABFE/complex'
    >>> workflow = ABFE(units='kcal/mol', software='Gromacs', dir=dir,
    >>>                 prefix='dhdl', suffix='xvg', T=298, out='./')
    >>> workflow.run(skiptime=10, uncorr='dhdl', threshold=50,
    >>>              methods=('mbar', 'bar', 'ti'), overlap='O_MBAR.pdf',
    >>>              breakdown=True, forwrev=10)

Do you mind have a try and give me some feedback?

@andresilvapimentel
Copy link
Author

I got this error message:
ModuleNotFoundError Traceback (most recent call last)
in ()
1 import os
2 from alchemtest.gmx import load_ABFE
----> 3 from alchemlyb.workflows import ABFE
4 # Obtain the path of the data
5 dir = os.path.dirname(load_ABFE()['data']['complex'][0])

ModuleNotFoundError: No module named 'alchemlyb.workflows'

@andresilvapimentel
Copy link
Author

I am not sure if you understood what I would like to do it.
I just want to convert my personal multiple xvg files (from each lambda) into a dataset (like load_benzene) to work with my notebook.
How can I perform this task?

@dotsdl
Copy link
Member

dotsdl commented Jun 7, 2022

Hi @andresilvapimentel, can you share the block of code you are running to load your XVG files, along with the exception/error you get? It's impossible for us to tell what the issue may be without this information.

@andresilvapimentel
Copy link
Author

HI @dotsdl , You asked exactly what I am asking for. I do not know how to load my xvg files. I the alchemlyb tutorial, the developers used:

from alchemtest.gmx import load_benzene
from alchemlyb.parsing.gmx import extract_dHdl
dataset = load_benzene()
dataset

My question is how to change the line:
dataset = load_benzene()

to upload my personal xvg files. So, my question is: How to convert my personal xvg files to a dataset?

@dotsdl
Copy link
Member

dotsdl commented Jun 8, 2022

Ah, now I see what you mean. I'll be the first to admit that we could use some clear examples in the documentation, but I believe you are referring to the example given in this section of the docs:

from alchemtest.gmx import load_benzene
from alchemlyb.parsing.gmx import extract_dHdl

dataset = load_benzene()

dhdl = extract_dHdl(dataset['data']['Coulomb'][0], 310)

dhdl.attrs['temperature']
dhdl.attrs['energy_unit']

If you run the dataset = load_benzene() line and have a look at the dataset object, you'll see that it's basically a dictionary, and dataset['data'] is a dictionary with two keys, 'Coulomb', and 'VDW'. The values for those keys are a list of paths to the example XVG files for benzene for each lambda window.

In the extract_dHdl line we parse the first such XVG file by passing its path in. So, to use your own XVG file, you need only pass your XVG file path to extract_dHdl, along with the temperature at which your simulation was sampled.

Does this help?

@andresilvapimentel
Copy link
Author

Yes. It helped. Thanks.

I upload the xvg files with:
from google.colab import files
uploaded = files.upload()

Then, I alocated uploaded as dataset:

dataset = {'data':{'Coulomb':[],'VDW':[]}}
for (k,v) in uploaded.items():
dataset['data']['Coulomb'].append(k)
dataset['data']['VDW'].append(k)

Now, the issue is in the df line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']]
df = forward_backward_convergence(data_list, 'mbar')
ax = plot_convergence(df)
ax.figure.savefig('dF_t.pdf')

The error message is:
ParameterError Traceback (most recent call last)
in ()
5 #bz = dataset['data']
6 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']]
----> 7 df = forward_backward_convergence(data_list, 'mbar')
8 ax = plot_convergence(df)
9 ax.figure.savefig('dF_t.pdf')

358 raise ParameterError(
359 'Warning: Should have \sum_n W_nk = 1. Actual column sum for state %d was %f. %d other columns have similar problems' %
--> 360 (firstbad, column_sums[firstbad], np.sum(badcolumns)))
361
362 row_sums = np.sum(W * N_k, axis=1)

ParameterError: Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 18.391322. 41 other columns have similar problems

This error also shows up in the following lines:

mbar_coul.fit(u_nk_coul)

@andresilvapimentel
Copy link
Author

The other error message shows up when I run the command block:

dHdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['Coulomb']])
ti_coul = TI().fit(dHdl_coul)
dHdl_vdw = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['VDW']])
ti_vdw = TI().fit(dHdl_vdw)

from alchemlyb.visualisation import plot_ti_dhdl
ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g'])
ax.figure.savefig('dhdl_TI.pdf')

The error message is:

ValueError Traceback (most recent call last)
in ()
10
11 from alchemlyb.visualisation import plot_ti_dhdl
---> 12 ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g'])
13 ax.figure.savefig('dhdl_TI.pdf')

/usr/local/lib/python3.7/dist-packages/alchemlyb/visualisation/ti_dhdl.py in plot_ti_dhdl(dhdl_data, labels, colors, units, ax)
95 raise ValueError(
96 'Length of labels ({}) should be the same as the number of data ({})'.format(
---> 97 len(labels), len(dhdl_list)))
98
99 if colors is None:

ValueError: Length of labels (2) should be the same as the number of data (4)

Do you know what is going on?

@dotsdl
Copy link
Member

dotsdl commented Jun 10, 2022

From what you shared, it looks like:

dataset = {'data':{'Coulomb':[],'VDW':[]}}
for (k,v) in uploaded.items():
dataset['data']['Coulomb'].append(k)
dataset['data']['VDW'].append(k)

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']]

should have instead for the last line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['data']['Coulomb']]

Does this fix your issue?

@andresilvapimentel
Copy link
Author

Hi @dotsdl Thank you for your suggestion.

No. I did not fix it. It is even worse because it gives the error message in the previous line:

KeyError Traceback (most recent call last)
in ()
3 from alchemlyb.convergence import forward_backward_convergence
4
----> 5 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['data']['Coulomb']]
6 df = forward_backward_convergence(data_list, 'mbar')
7 ax = plot_convergence(df)

KeyError: 'data'

Yesterday, I found a couple additional issues with this uploading process. I will try a couple more things before sharing with you.

@xiki-tempula
Copy link
Collaborator

Yes. It helped. Thanks.

I upload the xvg files with: from google.colab import files uploaded = files.upload()

Then, I alocated uploaded as dataset:

dataset = {'data':{'Coulomb':[],'VDW':[]}} for (k,v) in uploaded.items(): dataset['data']['Coulomb'].append(k) dataset['data']['VDW'].append(k)

Now, the issue is in the df line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] df = forward_backward_convergence(data_list, 'mbar') ax = plot_convergence(df) ax.figure.savefig('dF_t.pdf')

The error message is: ParameterError Traceback (most recent call last) in () 5 #bz = dataset['data'] 6 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] ----> 7 df = forward_backward_convergence(data_list, 'mbar') 8 ax = plot_convergence(df) 9 ax.figure.savefig('dF_t.pdf')

358 raise ParameterError( 359 'Warning: Should have \sum_n W_nk = 1. Actual column sum for state %d was %f. %d other columns have similar problems' % --> 360 (firstbad, column_sums[firstbad], np.sum(badcolumns))) 361 362 row_sums = np.sum(W * N_k, axis=1)

ParameterError: Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 18.391322. 41 other columns have similar problems

This error also shows up in the following lines:

mbar_coul.fit(u_nk_coul)

I think this could be solved by using autoMBAR instead of MBAR from alchemlyb.estimators import AutoMBAR as MBAR

@xiki-tempula
Copy link
Collaborator

The other error message shows up when I run the command block:

dHdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['Coulomb']]) ti_coul = TI().fit(dHdl_coul) dHdl_vdw = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['VDW']]) ti_vdw = TI().fit(dHdl_vdw)

from alchemlyb.visualisation import plot_ti_dhdl ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) ax.figure.savefig('dhdl_TI.pdf')

The error message is:

ValueError Traceback (most recent call last) in () 10 11 from alchemlyb.visualisation import plot_ti_dhdl ---> 12 ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) 13 ax.figure.savefig('dhdl_TI.pdf')

/usr/local/lib/python3.7/dist-packages/alchemlyb/visualisation/ti_dhdl.py in plot_ti_dhdl(dhdl_data, labels, colors, units, ax) 95 raise ValueError( 96 'Length of labels ({}) should be the same as the number of data ({})'.format( ---> 97 len(labels), len(dhdl_list))) 98 99 if colors is None:

ValueError: Length of labels (2) should be the same as the number of data (4)

Do you know what is going on?

I think this is a tricky issue, the plot_ti_dhdl assumes a certain form of input data, I might need your file to see what is the best way forward.

@andresilvapimentel
Copy link
Author

Hi @xiki-tempula Thank you for your comment.
I also think it is a trick issue. It seems it is related to the indexes of my dataset (and also the shapes of dhdl, dhdl_coul, dhdl_vdw, u_nk_coul and u_nk_vdw).

I also found that the data_list generated in my notebook is in a format different for what is required for the code works well.
The command line:
df = forward_backward_convergence(data_list, 'mbar')
also gives the same convergence error message. So, the use of from alchemlyb.estimators import AutoMBAR as MBAR is not going to help.

I would like to suggest something. My recomendation is to work in a module to read the xvg files and generate a dataset with
the right for of input data to avoid this issue.

I will work a little bit more. If I do not solve the issue, I will send my notebook and dataset to get further help. Which file do you need? The xvg files are big.

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Jun 10, 2022

@andresilvapimentel So the alchemlyb is intended to function as a library instead of offering an end-to-end solution, so technically, all the components to do a convergence analysis are available. We can obviously giving advice on how to use them.

@andresilvapimentel
Copy link
Author

I tried to use from alchemlyb.estimators import AutoMBAR as MBAR but It did not work as well giving the same error message about convergence.

@xiki-tempula
Copy link
Collaborator

@andresilvapimentel If the AutoMBAR cannot provide a solution, I'm afraid that it is out of the scoop of this repo. Our AutoMBAR is just a wrapper which tries different MBAR solvers to get it to work. In most cases, the AutoMBAR can find the right solver to solve the MBAR but ultimately, it is the pymbar that is doing the solving.
Maybe @dotsdl could consult with the pymbar people to see what is the best way forward.

@andresilvapimentel
Copy link
Author

@xiki-tempula How can I send the xvg files in order to help me with solution? What is the best way? Maybe, I am generating the xvg files different than you. Do you aggree with me?

@xiki-tempula
Copy link
Collaborator

@andresilvapimentel It is proprietary? If it is not, you could just put it in the comment section.
Like this
untitled folder.zip

@andresilvapimentel
Copy link
Author

@xiki-tempula It is my own xvg files:
dhdl.zip

I would be very thankful if you can solve the issue.

@xiki-tempula
Copy link
Collaborator

@andresilvapimentel Ok, I guess you need the devel version, follow the guide in https://alchemlyb.readthedocs.io/en/latest/install.html#installing-from-source

I could get the result with the devel version by

import glob
from alchemlyb.parsing.gmx import extract_u_nk
from alchemlyb.convergence import forward_backward_convergence
u_nk_list = [extract_u_nk(file, 300) for file in glob.glob('*.xvg')]
df = forward_backward_convergence(u_nk_list, 'autombar')

Note that your case seems to be very difficult to solve, so the calculation will take a long time.

@andresilvapimentel
Copy link
Author

The u_nk_list used by you has the same form of the data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] that I used, so it gave the same error message using the mbar.

I did not use the autombar because I need to install it. I did not try yet.

@andresilvapimentel
Copy link
Author

It did not converge after 1 h when the devel version was intalled.

WARNING: Did not converge to within specified tolerance.
max_delta = 6.668503e-09, tol = 1.000000e-12, maximum_iterations = 10000, iterations completed = 9999

@andresilvapimentel
Copy link
Author

As I did not solve the issue, I run the tutorial of the mdpow program using the benzene example to get the FEP data (from the xvg files). Then, I uploaded these xvg files into the alchemlyp notebook to try running this benzene example there.
I got the following errors:

  1. with the command df = forward_backward_convergence(data_list, 'mbar')

LinAlgError Traceback (most recent call last)
in ()
----> 1 df = forward_backward_convergence(data_list, 'mbar')

<array_function internals> in eigh(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
92
93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
---> 94 raise LinAlgError("Eigenvalues did not converge")
95
96 def _raise_linalgerror_svd_nonconvergence(err, flag):

LinAlgError: Eigenvalues did not converge

  1. with the command mbar_coul.fit(u_nk_coul)

LinAlgError Traceback (most recent call last)
in ()
1 mbar_coul = MBAR()
----> 2 mbar_coul.fit(u_nk_coul)
/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
92
93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
---> 94 raise LinAlgError("Eigenvalues did not converge")
95
96 def _raise_linalgerror_svd_nonconvergence(err, flag):

LinAlgError: Eigenvalues did not converge

  1. with the command mbar_coul = MBAR().fit(u_nk_coul)

Warning: BAR is likely to be inaccurate because of poor overlap. Improve the sampling, or decrease the spacing betweeen states. For now, guessing that the free energy difference is 0 with no uncertainty.

LinAlgError Traceback (most recent call last)
in ()
3 bar_coul = BAR().fit(u_nk_coul)
4 bar_vdw = BAR().fit(u_nk_vdw)
----> 5 mbar_coul = MBAR().fit(u_nk_coul)
<array_function internals> in eigh(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
92
93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
---> 94 raise LinAlgError("Eigenvalues did not converge")
95
96 def _raise_linalgerror_svd_nonconvergence(err, flag):

LinAlgError: Eigenvalues did not converge

It is weird this warning message (Warning: BAR is likely to be inaccurate because of poor overlap. Improve the sampling, or decrease the spacing betweeen states. For now, guessing that the free energy difference is 0 with no uncertainty.) because the tutorial should not have this issue.

Can you help me, please? I would appreciate if you can help me because I do not have any idea how to solve this issue.
Thanks.

@orbeckst
Copy link
Member

The benzene example for MDPOW is almost certainly not converged, the simulations are too short.

I would suggest you start out with a simple system where you know that you can get converged data. This can be benzene but you should run it for long enough (probably a few ns per window — but ultimately you should do some convergence analysis). Or use the GROMACS FEP tutorial. In any case, you should have a test case that you trust.

Manually calculating the free energy difference over the whole data set as you did in (2) and (3) is a good start. Try different estimators. Adjust convergence parameters. Then do the same manually including only half the data. See if anything breaks in between.

Ultimately, alchemlyb is not a plug-and-play solution. It provides building blocks and you need to develop an understanding of what the building blocks can do. I suggest reading recent papers on FEP calculations (Justin Lemkuhl's GROMACS tutorial (1) and the best practices paper (2).

  1. J. Lemkul. From proteins to perturbed Hamiltonians: A suite of tutorials for the Gromacs-2018 molecular simulation package [article v1.0]. Living Journal of Computational Molecular Science, 1(1):5068–, 10 2018. DOI: 10.33011/livecoms.1.1.5068
  2. Antonia S J S Mey, Bryce K Allen, Hannah E Bruce Macdonald, John D Chodera, David F Hahn, Maximilian Kuhn, Julien Michel, David L Mobley, Levi N Naden, Samarjeet Prasad, Andrea Rizzi, Jenke Scheen, Michael R Shirts, Gary Tresadern, Huafeng Xu. Best Practices for Alchemical Free Energy Calculations [Article v1.0].. Living J Comput Mol Sci 2 (1) 2020, . [PubMed:34458687] [DOI]

@mrshirts
Copy link

mrshirts commented Jun 21, 2022

@orbeckst @xiki-tempula , I've started working on moving testing different solutions within MBAR at choderalab/pymbar#442. Would be good to get any comments on what you would suggest.

@orbeckst
Copy link
Member

As a note: Google Colab is Python version 3.7.13 (default, Apr 24 2022, 01:04:09) (as of this writing). alchemlyb will likely drop Python 3.7 support soon (in line with NEP29, see e.g. PR #214 ) to keep up with dependencies such as Pandas (supports 3.8 – 3.10 in their latest release).

Therefore, unless Google updates Colab, it seems pointless to create tutorials in Colab if we cannot use the latest version of alchemlyb.

@orbeckst
Copy link
Member

orbeckst commented Nov 7, 2022

Minimum requirement for alchemlyb is 3.8 so it won't run in Colab until Google changes the version of Python.

Please re-open once Colab has been modernized.

@orbeckst orbeckst closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants