-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/ims lfq #166
base: master
Are you sure you want to change the base?
Feat/ims lfq #166
Conversation
@@ -19,7 +19,34 @@ impl FromParallelIterator<SageResults> for SageResults { | |||
.reduce(SageResults::default, |mut acc, x| { | |||
acc.features.extend(x.features); | |||
acc.quant.extend(x.quant); | |||
acc.ms1.extend(x.ms1); | |||
match (acc.ms1, x.ms1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this a "method" for the enum instead of being here, and repeated with the serial implementation.
|
||
for &idx in &order { | ||
if intensity_array[idx] <= 0.0 { | ||
// In theory ... if I set the intensity as mutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove outdated comment ...
Oddly enough ... this also seems to make the search faster ... I am not 100% sure why ... (pre-splitting the ms2 spectra is more cache friendly? ... dont think so, since we didnt have ms1s before ...)
This is the script I used to benchmark (not super strict benchmark)
and then logs are filtered with Data from @bsphinney, which I am not sure I can make public but I can DM you the link to it. |
This PR actually contains 2 features that are not really related to one another (which ended up combined bc of skill issues using git worktrees … LMK if you really want me to split them into separate PRs)
Initial support for LFQ on data with ion mobility.
This supersedes: #156
What are the key elements of this solution?
centroiding on the data. Which is very rudimentary right now but
seems to work well.
ProcessedSpectrum
struct into two variants, one where the elementsare
ProcessedSpectrum<Peak>
s (mz-int) and one where they areProcessedSpectrum<IMPeak>
s (mz-int-mobility).MS1Spectra
enum that can be eitherNoMobility
orWithMobility
which is passed with the search results to the LFQ code (so it filters on IM
on top of the mz).
FeatureMap
more sense for the scale of the mobility.
The bulk of the changes for this are ->
Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?
sections of the mobility range (which makes this PR supersede the previous attempt).
PrecursorRange
and one optional vec to the sizeof the
RawSpectrum
(which I feel should be fine) ... In theory theRawSpectrum
could be written to a generic and that gets propagated throughout the program
which would make the change 'zero-cost' but it felt really hard to do... (it is
a lot of re-writing + added complexity).
could add it as parameters to the config and propagate it throughout the program.
I am not sure where this would go in the input file (either close to
min_peak
or within the bruker processor)
Speedup on the generation of databases when large number of peptides are redundant.
This solution drops the library generation time from 45-50 seconds to 12-16 on a particularly redundant database in my laptop.
What are the key elements of this solution?
There are two/three main changes:
Aec<String>
toArc<str>
(which were used to track theprotein accessions). This speeds up the comparison of the sequences since the arc
pointer goes directly to the sequence, instead of going to the String that points
to the sequence (... I think). and (I think) saves a couple of bytes in the heap
(not in number of allocations but in final size).
DigestGroup
struct). This dramatically speeds up building libraries where manyproteins share peptides (isoform fasta files for instance).
These were being used for sorting but not for deduplication.
Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?
proteins, where nothing needs to be deduplicated but dont see it being a big deal in
practice.
Extras
Does this PR require a change to the docs?
Did you add or update tests for this change?
Please complete the following checklist:
CHANGELOG.md
, under the[Unreleased]
section heading. That entry references the issue closed by this PR.NOTE: teplate for the PR is here: https://raw.githubusercontent.com/tconbeer/harlequin/refs/heads/main/.github/PULL_REQUEST_TEMPLATE.md
Extras:
SpectrumReaderConfig
is implemented RN ... I think we should re-write it to a config within sage (which implements Into:: -> to the timsrust-native ones) ... also this is not documented anywhere ... (which I guess is fine bc its pretty experimental ...)Metadata
struct from a reader? RN I am doing this in a very hacky way (looking for theanalysis.tdf
, but I am not sure if this would work with the mini-tdf or the other variants supported by timsrust)