Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better global alignment when aligning in other direction #78

Open
glennhickey opened this issue Oct 17, 2024 · 9 comments
Open

Better global alignment when aligning in other direction #78

glennhickey opened this issue Oct 17, 2024 · 9 comments

Comments

@glennhickey
Copy link
Contributor

@adamnovak has been picking through the HPRC graph and finding suspect alignments. Here is one from CHM13#0#chr3:164033777-164033842. If I align it with abpoa in its forward orientation (on chm13) I get
region-2-abpoa
(where gaps are transparent).
But if I reverse complement I get
region-2-abpoa rev
which seems much cleaner -- ie there is only 1 gap per row except 3 cases, where the gap seems more properly placed on the right.

Are these alignment somehow scoring equivalently, even though by eye one seems much better? If not, is this expected or a bug? Do you have any suggestions on how it could be improved?

All the information to reproduce is here (see README for command lines):
https://public.gi.ucsc.edu/~hickey/debug/abpoa_direction_oct17_2024/

Thanks so much!

@yangao07
Copy link
Owner

yangao07 commented Oct 17, 2024

The difference comes from two reasons:

  1. you used seeding and progressive tree to order the input sequence, which does not work well for this repeat region sequences. I did get less gaps with the seeding disabled for forward strand.
  2. the more-than-one-gap alignment in the first MSA is actually optimal, even though its RC gets a one-gap alignment, because some gap is not penalized as it already exist in the partial order alignment graph. So, input order is very important for determining the number of gaps in the alignment.

Although I don't know which is better, they are all expected results.

Forward strand without seeding:
image
Reverse strand without seeding:
image

@yangao07
Copy link
Owner

But I do agree that they are not real optimal alignment results.
It may not be easy, but I will try to improve it.

@glennhickey
Copy link
Contributor Author

Thanks for the quick follow-up. By eye it still seems that the reverse with seeding is the best. I understand that the difference between the different scenarios is explainable by the order, and it's not reflected in the current scoring scheme.

I'm still not sure I understand the difference when aligning the different strands -- shouldn't the order be unaffected?

In any case, it does seem like there is room for future improvements -- we are happy to test any ideas you come up with!

@yangao07
Copy link
Owner

The difference between different strands is because abpoa always puts gaps in the left-most position.
To get the same result, gaps should be put on the right side for the reverse-comp strand.

@yangao07
Copy link
Owner

Hi @glennhickey again,

I am adding a parameter for abpoa to deal with this type of homopolymer sequence alignment to reach better visual alignment results. I also encountered this type of issue during my project.
Do you have any ready-to-use data that can work as an evaluation dataset? Like a set of input sequences with the expected MSA result or consensus sequence.
I can use some simulation data, but maybe you have some real-scenario ones.

Cheers!

@glennhickey
Copy link
Contributor Author

Apart from what I've shared in github issues here, we have a couple small simulated tests in Cactus

https://github.com/UCSantaCruzComputationalGenomicsLab/cactusTestData

where we use mafComparator to compare to the provided truth MAF.

If you end up making a significant change, I should be able to plug it into Cactus and, say, make a new pangenome graph and measure some stats on that...

@yangao07
Copy link
Owner

Is there any specific score parameters/matrix I should use for this data?

@glennhickey
Copy link
Contributor Author

Hmm, that data's probably best with the (current) default cactus scores, ex
https://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_mar21.mat
https://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_mar21.cmd

@yangao07
Copy link
Owner

Hi @glennhickey , I did come up with some heuristics for improved graph alignment. Would be great if you can have some comments on that.
Do you happen to have some time to talk via zoom?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants