Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem about preprocessing data for training #133

Open
yangnianzu0515 opened this issue Dec 28, 2024 · 0 comments
Open

Problem about preprocessing data for training #133

yangnianzu0515 opened this issue Dec 28, 2024 · 0 comments

Comments

@yangnianzu0515
Copy link

yangnianzu0515 commented Dec 28, 2024

Hello, thank you for your great work.

Before you fully released the training data, I studied your prediction code and noticed that when using the pretrained Boltz model for prediction, if there are no precomputed MSA results locally, it will call the MSA server specified by msa_server_url. When there are multiple protein chains in a single prediction, it calls the run_mmseqs2 function with use_pairing set to True.

I'd like to ask about the results when using run_mmseqs with use_pairing=True - it seems the results for each chain should be paired with other chains (if they have the same key, see key definition in

keys = [idx for idx, s in enumerate(paired) if s != "-" * len(s)]
), and these are combined with unpaired results. For each entity, results are written to a separate CSV file. When parsing the CSV results, there's a deduplication process, so each chain's MSA results are either paired or unpaired, with no overlap between the two since duplicates from unpaired results are removed if they already exist in paired results. From the paper, it seems the taxonomy information is used to serve the pairing of MSA results - is this to obtain paired MSAs?

Before you release the raw data processing pipeline, I thought I could use the prediction's data processing method to obtain MSAs for training. To speed up MSA retrieval, I set up a local MSA colabfold server and replaced the API URL. I've already obtained some MSAs through the prediction pipeline - does this mean I no longer need steps 5 and 6 from your training.md, since I have already obtained paired results using run_mmseqs with use_pairing=True?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant