Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Picking lambda to give equal weight to the two vectors #64

Open
MichelleKendall opened this issue Jan 5, 2016 · 2 comments
Open

Picking lambda to give equal weight to the two vectors #64

MichelleKendall opened this issue Jan 5, 2016 · 2 comments

Comments

@MichelleKendall
Copy link
Collaborator

Twitter question from Jo Williams-Newkirk: can we have a function to pick lambda eg. for equal weight on branch length & topology?

Thanks for the question, Jo. If I understand correctly, you're saying that when we have our tree vector:
v = (1- lambda) m + lambda M
it would be nice to pick lambda such that "(1- lambda) m" and "lambda M" give roughly equal contributions to v. Is that right?

If so, a rough way to do that is find the mean (or median) of the entries in m, call that number m*, and the mean (or median) of the entries in M, call that M*. Then solve
(1- lambda) m* = lambda M*
which gives
lambda = m* / (M* + m*)

However, I'm not sure about this because I think the mean / median could be obscuring some useful subtleties. The other problem is that when calculating the vector for one tree, you almost certainly want to calculate it for other trees so that you can start doing comparisons, and so you need to pick the same lambda for all the trees. A mean / median of all the vector entries isn't necessarily what you want... for one thing, it's going to bias towards more unbalanced trees, as they get the largest entries in m (and therefore typically in M too).

Another approach is to rescale the branch lengths in your trees so they have median length 1 before starting the comparison. That will make the behaviour under changing lambda a bit more intuitive, and will mean that lambda=0.5 gives approximately equal weight to topology and branch lengths, but will obscure other things (e.g. if tree A has identical topology to tree B but each branch is twice as long, then tree A and tree B will end up being identical when perhaps they shouldn't be).

So I'm not sure there's a one-size-fits-all way of approaching this, but maybe you have an idea? When comparing a bunch of trees to a reference tree there may be a more appropriate way of doing it based on the mean / median of the reference tree vector entries.

For the practical applications I've considered so far, my approach has been as follows:

  • see what happens at lambda=0: are there any outlying trees, or separated clusters?
  • see what happens at lambda=1: if they are trees from the same process (e.g. from BEAST) they are usually tightly clustered into one group (because where the trees differ, it's only on comparatively small branches). If not, look at where in the trees the differences are arising.
  • try in-between values of lambda (easiest using the Shiny app with a small sample) to see when / how things change. You quickly get a feel for how this depends on branch lengths and also where in the trees the differences are occurring.
@ajwnewkirk
Copy link

Thanks for the response! Let me outline my current use case for you and see if that gives you any ideas.

I have a collection of trees that contain whole genome SNP data for some number (say 40) of bacterial isolates that are members of the same outbreak cluster plus some closely related non-cluster strains. The trees are identical in the genomes that were included, but differ in the levels of sequencing coverage used to call SNPs. The differences in sequencing coverage were simulated by down sampling high coverage sequencing data to different levels of coverage ranging from 1x-50x. The down sampling was repeated 10 times for each level of coverage. The goal of the project is to determine at what level of coverage the SNP calls change enough to start "significantly" affecting the tree. We're also looking at a variety of other sequence-level metrics, but I thought using this metric and PCO to identify when / if coverage change started to create "outlier" trees would be a useful.

It may also be worth noting that these are unrooted ML trees generated in RAxML. I performed a midpoint root on each of them before adding them to the multiphylo object in R.

When I was using your package to examine the treespace generated by these trees (via findGroves), the parameters I manipulated most were the cluster algorithm, number of clusters, and lambda. Lambda obviously had a strong effect on the scatter of the points in the PCO, and I was struggling to figure out the best way to decide on a setting. As I understand lambda here, it controls the relative weight given to branch length and topology. A setting that gives equivalent weight to both characteristics seemed reasonable, but I could find no straightforward way to determine such a setting. Branch lengths were quite short since most isolates were so closely related, so I was exploring the 0.9-1.0 range given your comment in the paper that short branch lengths required high lambda values if branch length is to have much weight in the metric. But still, it felt very much like I was randomly manipulating parameters until I got a picture that matched my hypothesis.

I've attached two quick plots that I generated yesterday. The points are covered by the cluster they belong to based on findGroves. The alpha level indicates their coverage level, and one plot has a 30x coverage cutoff (our current required level of sequencing coverage for calling SNPs) and the other has a 12x coverage cutoff, which was the minimum acceptable level suggested by the other sequence-level quality metrics examined.

Thank you for your help!

KC_metric_PCO_30x.pdf
KC_metric_PCO_12x.pdf

@MichelleKendall
Copy link
Collaborator Author

Thanks for explaining this, it sounds like a fascinating project and I hope the tree metric will prove useful. It's great to see from your plots that the higher coverage trees are more similar to each other, which agrees with intuition!

A few thoughts:

  • "playing with" the number of clusters / method of clustering / values of lambda is easiest in the shiny app - have you tried that? Either launch it by typing "treescapeServer()" in R, or you can use it directly here: shiny.imperial-stats-experimental.co.uk/users/mlkendal/treescape
    Of course, as you point out, this absolutely is just manipulating parameters to see if anything interesting pops out. For a principled way to pick the number of clusters I recommend using find.clusters from adegenet and minimising the BIC as described here: http://adegenet.r-forge.r-project.org/files/tutorial-dapc.pdf
    Integrating that properly into treescape is on my "to do" list..!
  • rooting: midpoint rooting may be working just fine but it would be worth checking lambda=0 to see if there are any well-separated clusters, and if so, whether those sets of trees only really differ by root position. If so, you'll probably want to eliminate that before continuing. If the non-cluster strains form a monophyletic clade then perhaps you could root them all on that? Failing that, you can always root all the trees on the same tip for this part of the analysis.
  • have you tried viewing the space in 3d? Again, this is straightforward in the shiny app, or there's an example at the end of the ?treescape help file using rgl. It may be that the clusters are more distinct / informative if you keep another dimension in the plot.
  • perhaps it would be interesting to temporarily forget clustering and instead colour the plot according to coverage level. This might give a handle on what might be a "sufficient" level of coverage. Another way of looking at it is to take what you consider to be your "best" tree (full coverage?) and compare each other tree to that (there's a handy function called "refTreeDist" in the development version of treescape here on GitHub, soon to be rolled out on CRAN). This might give you a good idea of when things (hopefully) start to converge to the "right" tree.

I'm aware that none of that addresses the main question, sorry. I will give it some more thought and get back to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants