-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Picking lambda to give equal weight to the two vectors #64
Comments
Thanks for the response! Let me outline my current use case for you and see if that gives you any ideas. I have a collection of trees that contain whole genome SNP data for some number (say 40) of bacterial isolates that are members of the same outbreak cluster plus some closely related non-cluster strains. The trees are identical in the genomes that were included, but differ in the levels of sequencing coverage used to call SNPs. The differences in sequencing coverage were simulated by down sampling high coverage sequencing data to different levels of coverage ranging from 1x-50x. The down sampling was repeated 10 times for each level of coverage. The goal of the project is to determine at what level of coverage the SNP calls change enough to start "significantly" affecting the tree. We're also looking at a variety of other sequence-level metrics, but I thought using this metric and PCO to identify when / if coverage change started to create "outlier" trees would be a useful. It may also be worth noting that these are unrooted ML trees generated in RAxML. I performed a midpoint root on each of them before adding them to the multiphylo object in R. When I was using your package to examine the treespace generated by these trees (via findGroves), the parameters I manipulated most were the cluster algorithm, number of clusters, and lambda. Lambda obviously had a strong effect on the scatter of the points in the PCO, and I was struggling to figure out the best way to decide on a setting. As I understand lambda here, it controls the relative weight given to branch length and topology. A setting that gives equivalent weight to both characteristics seemed reasonable, but I could find no straightforward way to determine such a setting. Branch lengths were quite short since most isolates were so closely related, so I was exploring the 0.9-1.0 range given your comment in the paper that short branch lengths required high lambda values if branch length is to have much weight in the metric. But still, it felt very much like I was randomly manipulating parameters until I got a picture that matched my hypothesis. I've attached two quick plots that I generated yesterday. The points are covered by the cluster they belong to based on findGroves. The alpha level indicates their coverage level, and one plot has a 30x coverage cutoff (our current required level of sequencing coverage for calling SNPs) and the other has a 12x coverage cutoff, which was the minimum acceptable level suggested by the other sequence-level quality metrics examined. Thank you for your help! |
Thanks for explaining this, it sounds like a fascinating project and I hope the tree metric will prove useful. It's great to see from your plots that the higher coverage trees are more similar to each other, which agrees with intuition! A few thoughts:
I'm aware that none of that addresses the main question, sorry. I will give it some more thought and get back to you! |
Twitter question from Jo Williams-Newkirk: can we have a function to pick lambda eg. for equal weight on branch length & topology?
Thanks for the question, Jo. If I understand correctly, you're saying that when we have our tree vector:
v = (1- lambda) m + lambda M
it would be nice to pick lambda such that "(1- lambda) m" and "lambda M" give roughly equal contributions to v. Is that right?
If so, a rough way to do that is find the mean (or median) of the entries in m, call that number m*, and the mean (or median) of the entries in M, call that M*. Then solve
(1- lambda) m* = lambda M*
which gives
lambda = m* / (M* + m*)
However, I'm not sure about this because I think the mean / median could be obscuring some useful subtleties. The other problem is that when calculating the vector for one tree, you almost certainly want to calculate it for other trees so that you can start doing comparisons, and so you need to pick the same lambda for all the trees. A mean / median of all the vector entries isn't necessarily what you want... for one thing, it's going to bias towards more unbalanced trees, as they get the largest entries in m (and therefore typically in M too).
Another approach is to rescale the branch lengths in your trees so they have median length 1 before starting the comparison. That will make the behaviour under changing lambda a bit more intuitive, and will mean that lambda=0.5 gives approximately equal weight to topology and branch lengths, but will obscure other things (e.g. if tree A has identical topology to tree B but each branch is twice as long, then tree A and tree B will end up being identical when perhaps they shouldn't be).
So I'm not sure there's a one-size-fits-all way of approaching this, but maybe you have an idea? When comparing a bunch of trees to a reference tree there may be a more appropriate way of doing it based on the mean / median of the reference tree vector entries.
For the practical applications I've considered so far, my approach has been as follows:
The text was updated successfully, but these errors were encountered: