Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fancier statistics for counting #9

Closed
roryk opened this issue Mar 16, 2016 · 15 comments
Closed

Fancier statistics for counting #9

roryk opened this issue Mar 16, 2016 · 15 comments

Comments

@roryk
Copy link
Collaborator

roryk commented Mar 16, 2016

Did you have something in mind about how to do something more 'correct' than weighting by number of hits? Something like the way Salmon does it at the end or something like that? Do you know how that works? I am not smart enough to understand it.

@vals
Copy link
Owner

vals commented Mar 21, 2016

Yes I spent a couple of days thinking about that last year, going through the Salmon and Kallisto preprints to try to see how their models would relate to the UMI-Tags case.

One thing that is immediate is that for tag counting a lot of the coverage parameters becomes simpler, since we only expect the end tags cDNA to be sampled, and that case seems pretty simple to implement. You can probably sort of hack it in by changing your reference transcriptome to just the last ~200 bases of each transctipt before you make the index.

That does not use any information from UMIs though. The computer science problem that one essentially solves when counting UMIs (assuming only unique mappings) is the Count-distinct problem. To solve this problem, I think one needs to find a way to combine the relative abundance estimation problem with the count-distinct problem. I can't figure out how to do that in a sensible way.

You have to solve a discrete problem using continuous information I think.

Another way of looking at it would be to make a Salmon-like algorithm and use the UMI's to estimate some PCR duplication auxiliary parameter per transcript. Then at the end multiply the relative abundance measure (then corrected for duplication) with the total observed/possible number of UMIs.

I'm copying in @rob-p because he might have some ideas and I think it's a discussion that would be interesting to have in public where other people can chime in.

I'm also attaching my (kind of legible... ) notes on this in the notation of the Salmon preprint (I think)

20160321110618437.pdf

@roryk
Copy link
Collaborator Author

roryk commented Mar 21, 2016

Hi @vals,

What do you think about pre-filtering the BAMs to drop all PCR duplicates? Then there wouldn't be a UMI problem anymore and we could maybe split the BAM by cellular barcode and run salmon on the split BAMs?

@vals
Copy link
Owner

vals commented Mar 21, 2016

When you're mapping the read, in essence, you are assigning the UMI connected to that read to the transcript it maps to.

Say you have two reads, and the mapping assigns the UMI to transcripts in this way:

(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1), (tx_3, umi_1)
(read_2, umi_1) -> (tx_2, umi_1), (tx_3, umi_1), (tx_4, umi_1)

We don't know from where umi_1 was duplicated. When you filter this, you would remove one copy of each of the reads due to tx_2 or tx_3 evidence. But on the other hand not due to the other two transcripts. If you would convert this deduped bam back to fastq, the two reads would end up being four reads...

This mapping ambiguity is why you need to solve the relative abundance problem. In Ntranos et al they get around this problem by considering all the transcripts a read can map to (a mapping equivalence class) as a single thing.

@rob-p
Copy link

rob-p commented Mar 24, 2016

Hi guys,

I feel like, perhaps, I still don't understand the formal problem definition well enough. We are given a collection of N fragments containing N' umi tags where N' <= N; is that right? Now what, precisely, is our goal? Is it simply that we want to assign UMIs (uniquely or proportionally) to cDNAs? If so, what is the meaning of the duplicated UMIs (i.e. if we observe the same UMI in two or more different reads)? For example, in the example given above, do we want to consider (read_1, umi_1) and (read_2, umi_1) as one occurrence of umi_1 for the purposes of allocation? Sorry for my naïveté on this front, but I just want to understand the formal problem solidly before trying to think about a solution.

@vals
Copy link
Owner

vals commented Mar 24, 2016

Hey Rob,

I'll try to provide some background.

I think the general reason for sequencing end tags rather than fragments from the entire transcript is to save sequencing cost. You need to sequence less per sample to reach saturation since the libraries are much less complex. Unlike that RNA-Seq case, fragments arising from transcripts will not scale due to length of transcript.

The problem comes from sequencing end tag sequences with small amounts of starting material, when a very large number of PCR cycles (or IVT) is needed. But some cDNA-tags will be preferentially amplified. Fragment amplification bias is quite a large concern in RNA-seq, but even more so when there is such low complexity in the starting material in the tag-sequencing case.

To try to counteract this, each cDNA-tag is labeled with a random k-mer called a UMI, (where k ranges from 4 to 32 depending on implementation). If k = 6, every cDNA-tag in the sample has the possibility of being labeled in 4,096 ways. When amplifying, the (UMI, cDNA-tag) tuples gets amplified together. So, say that we sequence this amplified library of (UMI, cDNA-tag) tuples. If we observe 1,000 cDNA-tag-fragments, and they have 10 unique UMI's, each observed 100 times, we would infer that we actually started with 10 cDNA-tags, which were amplified 100 times.

In an ideal world, we could solve this by going through the fragments in a FASTQ file, and remove every duplicated (UMI, fragment) pair. But both during amplification and sequencing errors can be introduced in the fragments, so exact matching is the wrong thing to do. Also the RT step which produces the tags is not exact, and the cDNA-tags might be a bit shifted from the very end of transcript.

So for that strategy we would have to do some clustering / fuzzy matching of fragments, then count unique UMI's in each cluster of fragments.

The strategy here, is that since we know that the cDNA-tags arise from the transcriptome, fuzzy matching is treated as the mapping to transcripts. So we transfer a umi from (umi, fragment) to (umi, transcript). Now the counting is easy, as we know a-priori the transcripts. However, because of mapping ambiguity, the umi gets transferred to multiple transcripts.

The problem is, how do we allocate these counts to the individual transcripts given that ambiguity? (Which later get collapsed to "genes").

In the example above we want to consider (read_1, umi_1) and (read_2, umi_1) as one occurrence if the sequence of read_1 is sufficiently similar to that of read_2. If the situation was

(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_1) -> (tx_1, umi_1)

we would count it as on occurrence of tx_1 in the sample. But if

(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_2) -> (tx_1, umi_2)

we would count it as two occurrences. Similarly if

(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_1) -> (tx_2, umi_1)

we would infer that both tx_1 and tx_2 had been observed once.

In the current counting implementation in this package, if we have

(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)
(read_2, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)

each of tx_1 and tx_2 gets 0.5 "observation" or "evidence". If

(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)
(read_2, umi_1) -> (tx_2, umi_1), (tx_3, umi_1)

then each of tx_1, tx_2, tx_3 gets 0.5 "evidence". This is were things go wrong I think. And the problem is, how do we allocate the umi_1 observation evidence to tx_1, tx_2 and tx_3?

@rob-p
Copy link

rob-p commented Mar 24, 2016

Thanks Valentine,

This explanation is very clear, and I'll think about this more deeply. My first thought, though, is, would it be possible to do what Sailfish / Salmon normally do in terms of equivalence classes, but simply "adjust" the equivalence class counts by removing duplicate UMI tags? Basically, the idea is that the count of reads in an equivalence class would be equal to the number of distinct UMIs in that class, rather than the actual count of fragments. Given this information, you could then run the EM (or VBEM) "normally" to allocate these counts to the transcripts. One caveat here is since the equivalence classes forget the sequence of the reads they contain, I suppose you could have a situation where two different reads having the same UMI fall into the same equivalence class, and therefore get discounted. Otherwise, thought, I think the general approach would do the "right thing".

--Rob

@vals
Copy link
Owner

vals commented Mar 24, 2016

Yes I think a mode like that for Sailfish / Salmon would be very appropriate. With the other assumption based on the data that "effectiveLength" should be set to a constant value for each transcript in the model. And to get absolute counts at the end, the numReads can be based on the number deduplicated reads rather than all input reads.

@vals
Copy link
Owner

vals commented Mar 24, 2016

@rob-p ... I just realised that such a mode in Sailfish / Salmon could be referred to as a "caviar" mode.. (because... single cells...)

@rob-p
Copy link

rob-p commented Mar 24, 2016

👍 --- loves it!

@flying-sheep
Copy link
Contributor

here is some prior art: https://github.com/CGATOxford/UMI-tools

details here

@vals
Copy link
Owner

vals commented May 3, 2016

@flying-sheep I'm aware of those, I wrote these scripts because aligning to a genome is a waste of time and resources. And the script to extract barcodes and UMIs was not general enough to handle all the different kinds of data sets I was working on.

But this issue in particular is about considering things which those tools completely ignore, which is uncertainty of the source of a tag.

@flying-sheep
Copy link
Contributor

flying-sheep commented May 3, 2016

oh, sorry, i didn’t read the whole thing and didn’t check where “weighting by number of hits” comes into play. thanks for the clarification. maybe you should make it clear in the issue title?

thanks for building this, its functionality is perfect for me! might have to make it faster though.

your choice of tools and libraries seems to match mine pretty well, so if you thought about writing in a faster language, you might also enjoy the Rust programming language.

@vals
Copy link
Owner

vals commented May 3, 2016

@flying-sheep Sorry for the dismissive tone that I'm catching rereading my response...

There's another issue (#5) which speaks about the UMI-merging problem.

Let me know if you manage to make it faster, I benchmarked several SAM parsing libraries to find the fastest one (hts-python, simplesam, and pysam), and found pysam, which is written in C, to be fastest. And generally, profiling indicates it's IO bound.

I'm learning Julia at the moment, Rust seems a bit strict to me.

@flying-sheep
Copy link
Contributor

flying-sheep commented May 3, 2016

Sorry for the dismissive tone that I'm catching rereading my response...

none perceived, don’t worry.

Let me know if you manage to make it faster

if anything, i’ll rewrite it using rust – eventually, it’s not a priority right now. handling every single read with python is slooow.

I'm learning Julia at the moment, Rust seems a bit strict to me.

also a good choice i’m sure! i gathered it’s basically a faster, non-ugly, open source MATLAB, or an R without the lazy evaluation insanity and the feeling of wading through tar. ugh, MATLAB… seriously, if a language requires me to write argument parsers instead of using keyword arguments, someone didn’t have a sense for priorities.

@vals
Copy link
Owner

vals commented Aug 16, 2016

Closing this as it is basically handled by the kallisto single cell mode.

@vals vals closed this as completed Aug 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants