Problem about preprocessing data for training #133

yangnianzu0515 · 2024-12-28T15:19:31Z

Hello, thank you for your great work.

Before you fully released the training data, I studied your prediction code and noticed that when using the pretrained Boltz model for prediction, if there are no precomputed MSA results locally, it will call the MSA server specified by msa_server_url. When there are multiple protein chains in a single prediction, it calls the run_mmseqs2 function with use_pairing set to True.

I'd like to ask about the results when using run_mmseqs with use_pairing=True - it seems the results for each chain should be paired with other chains (if they have the same key, see key definition in

boltz/src/boltz/main.py

Line 215 in 9d88b09

keys = [idx for idx, s in enumerate(paired) if s != "-" * len(s)]

), and these are combined with unpaired results. For each entity, results are written to a separate CSV file. When parsing the CSV results, there's a deduplication process, so each chain's MSA results are either paired or unpaired, with no overlap between the two since duplicates from unpaired results are removed if they already exist in paired results. From the paper, it seems the taxonomy information is used to serve the pairing of MSA results - is this to obtain paired MSAs?

Before you release the raw data processing pipeline, I thought I could use the prediction's data processing method to obtain MSAs for training. To speed up MSA retrieval, I set up a local MSA colabfold server and replaced the API URL. I've already obtained some MSAs through the prediction pipeline - does this mean I no longer need steps 5 and 6 from your training.md, since I have already obtained paired results using run_mmseqs with use_pairing=True?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem about preprocessing data for training #133

Problem about preprocessing data for training #133

yangnianzu0515 commented Dec 28, 2024 •

edited

Loading

Problem about preprocessing data for training #133

Problem about preprocessing data for training #133

Comments

yangnianzu0515 commented Dec 28, 2024 • edited Loading

yangnianzu0515 commented Dec 28, 2024 •

edited

Loading