-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fancier statistics for counting #9
Comments
Yes I spent a couple of days thinking about that last year, going through the Salmon and Kallisto preprints to try to see how their models would relate to the UMI-Tags case. One thing that is immediate is that for tag counting a lot of the coverage parameters becomes simpler, since we only expect the end tags cDNA to be sampled, and that case seems pretty simple to implement. You can probably sort of hack it in by changing your reference transcriptome to just the last ~200 bases of each transctipt before you make the index. That does not use any information from UMIs though. The computer science problem that one essentially solves when counting UMIs (assuming only unique mappings) is the Count-distinct problem. To solve this problem, I think one needs to find a way to combine the relative abundance estimation problem with the count-distinct problem. I can't figure out how to do that in a sensible way. You have to solve a discrete problem using continuous information I think. Another way of looking at it would be to make a Salmon-like algorithm and use the UMI's to estimate some PCR duplication auxiliary parameter per transcript. Then at the end multiply the relative abundance measure (then corrected for duplication) with the total observed/possible number of UMIs. I'm copying in @rob-p because he might have some ideas and I think it's a discussion that would be interesting to have in public where other people can chime in. I'm also attaching my (kind of legible... ) notes on this in the notation of the Salmon preprint (I think) |
Hi @vals, What do you think about pre-filtering the BAMs to drop all PCR duplicates? Then there wouldn't be a UMI problem anymore and we could maybe split the BAM by cellular barcode and run salmon on the split BAMs? |
When you're mapping the read, in essence, you are assigning the UMI connected to that read to the transcript it maps to. Say you have two reads, and the mapping assigns the UMI to transcripts in this way:
We don't know from where This mapping ambiguity is why you need to solve the relative abundance problem. In Ntranos et al they get around this problem by considering all the transcripts a read can map to (a mapping equivalence class) as a single thing. |
Hi guys, I feel like, perhaps, I still don't understand the formal problem definition well enough. We are given a collection of |
Hey Rob, I'll try to provide some background. I think the general reason for sequencing end tags rather than fragments from the entire transcript is to save sequencing cost. You need to sequence less per sample to reach saturation since the libraries are much less complex. Unlike that RNA-Seq case, fragments arising from transcripts will not scale due to length of transcript. The problem comes from sequencing end tag sequences with small amounts of starting material, when a very large number of PCR cycles (or IVT) is needed. But some cDNA-tags will be preferentially amplified. Fragment amplification bias is quite a large concern in RNA-seq, but even more so when there is such low complexity in the starting material in the tag-sequencing case. To try to counteract this, each cDNA-tag is labeled with a random k-mer called a UMI, (where k ranges from 4 to 32 depending on implementation). If k = 6, every cDNA-tag in the sample has the possibility of being labeled in 4,096 ways. When amplifying, the (UMI, cDNA-tag) tuples gets amplified together. So, say that we sequence this amplified library of (UMI, cDNA-tag) tuples. If we observe 1,000 cDNA-tag-fragments, and they have 10 unique UMI's, each observed 100 times, we would infer that we actually started with 10 cDNA-tags, which were amplified 100 times. In an ideal world, we could solve this by going through the fragments in a FASTQ file, and remove every duplicated (UMI, fragment) pair. But both during amplification and sequencing errors can be introduced in the fragments, so exact matching is the wrong thing to do. Also the RT step which produces the tags is not exact, and the cDNA-tags might be a bit shifted from the very end of transcript. So for that strategy we would have to do some clustering / fuzzy matching of fragments, then count unique UMI's in each cluster of fragments. The strategy here, is that since we know that the cDNA-tags arise from the transcriptome, fuzzy matching is treated as the mapping to transcripts. So we transfer a umi from (umi, fragment) to (umi, transcript). Now the counting is easy, as we know a-priori the transcripts. However, because of mapping ambiguity, the umi gets transferred to multiple transcripts. The problem is, how do we allocate these counts to the individual transcripts given that ambiguity? (Which later get collapsed to "genes"). In the example above we want to consider
we would count it as on occurrence of
we would count it as two occurrences. Similarly if
we would infer that both In the current counting implementation in this package, if we have
each of
then each of |
Thanks Valentine, This explanation is very clear, and I'll think about this more deeply. My first thought, though, is, would it be possible to do what Sailfish / Salmon normally do in terms of equivalence classes, but simply "adjust" the equivalence class counts by removing duplicate UMI tags? Basically, the idea is that the count of reads in an equivalence class would be equal to the number of distinct UMIs in that class, rather than the actual count of fragments. Given this information, you could then run the EM (or VBEM) "normally" to allocate these counts to the transcripts. One caveat here is since the equivalence classes forget the sequence of the reads they contain, I suppose you could have a situation where two different reads having the same UMI fall into the same equivalence class, and therefore get discounted. Otherwise, thought, I think the general approach would do the "right thing". --Rob |
Yes I think a mode like that for Sailfish / Salmon would be very appropriate. With the other assumption based on the data that "effectiveLength" should be set to a constant value for each transcript in the model. And to get absolute counts at the end, the numReads can be based on the number deduplicated reads rather than all input reads. |
@rob-p ... I just realised that such a mode in Sailfish / Salmon could be referred to as a "caviar" mode.. (because... single cells...) |
👍 --- loves it! |
here is some prior art: https://github.com/CGATOxford/UMI-tools details here |
@flying-sheep I'm aware of those, I wrote these scripts because aligning to a genome is a waste of time and resources. And the script to extract barcodes and UMIs was not general enough to handle all the different kinds of data sets I was working on. But this issue in particular is about considering things which those tools completely ignore, which is uncertainty of the source of a tag. |
oh, sorry, i didn’t read the whole thing and didn’t check where “weighting by number of hits” comes into play. thanks for the clarification. maybe you should make it clear in the issue title? thanks for building this, its functionality is perfect for me! might have to make it faster though. your choice of tools and libraries seems to match mine pretty well, so if you thought about writing in a faster language, you might also enjoy the Rust programming language. |
@flying-sheep Sorry for the dismissive tone that I'm catching rereading my response... There's another issue (#5) which speaks about the UMI-merging problem. Let me know if you manage to make it faster, I benchmarked several SAM parsing libraries to find the fastest one (hts-python, simplesam, and pysam), and found pysam, which is written in C, to be fastest. And generally, profiling indicates it's IO bound. I'm learning Julia at the moment, Rust seems a bit strict to me. |
none perceived, don’t worry.
if anything, i’ll rewrite it using rust – eventually, it’s not a priority right now. handling every single read with python is slooow.
also a good choice i’m sure! i gathered it’s basically a faster, non-ugly, open source MATLAB, or an R without the lazy evaluation insanity and the feeling of wading through tar. ugh, MATLAB… seriously, if a language requires me to write argument parsers instead of using keyword arguments, someone didn’t have a sense for priorities. |
Closing this as it is basically handled by the kallisto single cell mode. |
Did you have something in mind about how to do something more 'correct' than weighting by number of hits? Something like the way Salmon does it at the end or something like that? Do you know how that works? I am not smart enough to understand it.
The text was updated successfully, but these errors were encountered: