Feat/ims lfq #166

jspaezp · 2025-01-19T01:19:01Z

This PR actually contains 2 features that are not really related to one another (which ended up combined bc of skill issues using git worktrees … LMK if you really want me to split them into separate PRs)

Initial support for LFQ on data with ion mobility.

This supersedes: #156

What are the key elements of this solution?

Adds the parsing-reading of the IM-containing spectra from tdf files.
- Storing all peaks was very slow, so I also added a way to do some
  centroiding on the data. Which is very rudimentary right now but
  seems to work well.
Splits the ProcessedSpectrum struct into two variants, one where the elements
are ProcessedSpectrum<Peak>s (mz-int) and one where they are
ProcessedSpectrum<IMPeak>s (mz-int-mobility).
Adds a new MS1Spectra enum that can be either NoMobility or WithMobility
which is passed with the search results to the LFQ code (so it filters on IM
on top of the mz).
Implements the differential handling of the two variants in the FeatureMap
Adds a variant of the Tolerance enum to be 'Pct' (percent), since it makes a lot
more sense for the scale of the mobility.

The bulk of the changes for this are ->

crates/sage-cloudpath/src/tdf.rs
crates/sage/src/lfq.rs
crates/sage/src/spectrum.rs

Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?

I believe this greatly improves the quant quality because it removes noise from irrelevant
sections of the mobility range (which makes this PR supersede the previous attempt).
Tradeoffs:
- It adds one f32 to the size of the PrecursorRange and one optional vec to the size
  of the RawSpectrum (which I feel should be fine) ... In theory the RawSpectrum
  could be written to a generic and that gets propagated throughout the program
  which would make the change 'zero-cost' but it felt really hard to do... (it is
  a lot of re-writing + added complexity).
Right now the mobility range used for centroiding (but not LFQ) is hard-coded, we
could add it as parameters to the config and propagate it throughout the program.
I am not sure where this would go in the input file (either close to min_peak
or within the bruker processor)

Speedup on the generation of databases when large number of peptides are redundant.

This solution drops the library generation time from 45-50 seconds to 12-16 on a particularly redundant database in my laptop.

What are the key elements of this solution?

There are two/three main changes:

Changing the instances of Aec<String> to Arc<str> (which were used to track the
protein accessions). This speeds up the comparison of the sequences since the arc
pointer goes directly to the sequence, instead of going to the String that points
to the sequence (... I think). and (I think) saves a couple of bytes in the heap
(not in number of allocations but in final size).
Additional deduplication stage at the digest level (which is done introducing the
DigestGroup struct). This dramatically speeds up building libraries where many
proteins share peptides (isoform fasta files for instance).
Using the monoisoropic mass for the comparison of the peptide for deduplication.
These were being used for sorting but not for deduplication.

Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?

In theory this will be a bit slower in libraries that have many single-peptide-unique
proteins, where nothing needs to be deduplicated but dont see it being a big deal in
practice.

Extras

Does this PR require a change to the docs?

No.
Yes, and I have opened a PR at ...
Yes; I haven't opened a PR, but the gist of the change is:
- New parameter for the IM tolerance in the LFQ section of the docs.
- I also noticed that none of the config for the bruker processing is documented ...

Did you add or update tests for this change?

Yes. The only changes in tests were to change Arc::new(String) to Arc::from(String).
No, I believe tests aren't necessary. No tests were added but the changes are tested by sage_core::database::digestion test.

Please complete the following checklist:

I have added an entry to CHANGELOG.md, under the [Unreleased] section heading. That entry references the issue closed by this PR.
I acknowledge Sage's license. I do not own my contribution.

NOTE: teplate for the PR is here: https://raw.githubusercontent.com/tconbeer/harlequin/refs/heads/main/.github/PULL_REQUEST_TEMPLATE.md

Extras:

I am not the biggest fan of how SpectrumReaderConfig is implemented RN ... I think we should re-write it to a config within sage (which implements Into:: -> to the timsrust-native ones) ... also this is not documented anywhere ... (which I guess is fine bc its pretty experimental ...)
@sander-willems-bruker How hard would it be to implement a way to get the Metadata struct from a reader? RN I am doing this in a very hacky way (looking for the analysis.tdf, but I am not sure if this would work with the mini-tdf or the other variants supported by timsrust)

jspaezp · 2025-01-19T08:07:13Z

crates/sage-cli/src/output.rs

@@ -19,7 +19,34 @@ impl FromParallelIterator<SageResults> for SageResults {
            .reduce(SageResults::default, |mut acc, x| {
                acc.features.extend(x.features);
                acc.quant.extend(x.quant);
-                acc.ms1.extend(x.ms1);
+                match (acc.ms1, x.ms1) {


I think we can make this a "method" for the enum instead of being here, and repeated with the serial implementation.

crates/sage-cloudpath/src/tdf.rs

…ms data

jspaezp · 2025-01-21T04:39:12Z

crates/sage-cloudpath/src/tdf.rs

+
+    for &idx in &order {
+        if intensity_array[idx] <= 0.0 {
+            // In theory ... if I set the intensity as mutable


TODO: remove outdated comment ...

jspaezp · 2025-01-21T04:43:33Z

Oddly enough ... this also seems to make the search faster ... I am not 100% sure why ... (pre-splitting the ms2 spectra is more cache friendly? ... dont think so, since we didnt have ms1s before ...)

>>> MASTER
[2025-01-19T23:00:20Z INFO  sage::runner] generated 412346146 fragments, 12814433 peptides in 44020ms
[2025-01-19T23:03:31Z INFO  sage::runner] - search:    168367 ms (1220 spectra/s)
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 0 target MS1 peaks at 5% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 102338 target peptide-spectrum matches at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 41553 target peptides at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 17550 target proteins at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] finished in 240s
+bench.bash:27> echo '>>> LFQ'
>>> LFQ
[2025-01-19T23:08:10Z INFO  sage::runner] generated 412346146 fragments, 12814433 peptides in 16.701664708s
[2025-01-19T23:10:51Z INFO  sage::runner] - search:    111926 ms (1836 spectra/s)
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 25156 target MS1 peaks at 5% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 102338 target peptide-spectrum matches at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 41553 target peptides at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 17550 target proteins at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] finished in 184s

This is the script I used to benchmark (not super strict benchmark)

set -x
set -e

git checkout master
cargo b --release --bin sage
cp target/release/sage sage_master
git checkout feat/ims_lfq
cargo b --release --bin sage
cp target/release/sage sage_lfq

FASTA_USE="Downloads/allcoonsequences2024.fasta"
FILE_USE="Downloads/09272020_PladB-4hr-5_Slot1-67_1_2098.d"

for i in {1..5}; do
  echo $i
  echo ">>> MASTER"
  SAGE_LOG=trace ./sage_master --fasta $FASTA_USE tdf_sageconfig_dda.json $FILE_USE
  echo "Done"
  echo ">>> LFQ"
  SAGE_LOG=trace ./sage_lfq --fasta $FASTA_USE tdf_sageconfig_dda.json $FILE_USE
  echo "Done"
done

and then logs are filtered with grep -E "MASTER|BRANCH|>>>|search:|(^\d)|FDR|generated" (showing the results only of the last iteration, but all look pretty similar)

Data from @bsphinney, which I am not sure I can make public but I can DM you the link to it.

jspaezp added 7 commits January 17, 2025 18:56

feat: use of ion mobility in quant

69b65ec

chore: better notice of potential overflow in centroiding and cargo fmt

f145f5d

feat: speedup on database gen (digest grouping) and fmt

c2d9544

chore: removed unnecessary generic and lint

5b1b851

feat: parameter propagation for mobility tolerance

3563ef8

oops: removed print debug line

c3b928c

chore: code cleanup and timsrust update

c995e96

jspaezp commented Jan 19, 2025

View reviewed changes

jspaezp commented Jan 20, 2025

View reviewed changes

crates/sage-cloudpath/src/tdf.rs Outdated Show resolved Hide resolved

fix,refactor: error on excessive removal of points when centroiding i…

6e52396

…ms data

jspaezp commented Jan 21, 2025

View reviewed changes

jspaezp added 2 commits January 21, 2025 15:56

feat,format: pushdown of parallel reading and cargo fmt

e79bc6e

fix: updated tests for raw format id

f3e2565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/ims lfq #166

Feat/ims lfq #166

jspaezp commented Jan 19, 2025 •

edited

Loading

jspaezp Jan 19, 2025

jspaezp Jan 21, 2025

jspaezp commented Jan 21, 2025

Feat/ims lfq #166

Are you sure you want to change the base?

Feat/ims lfq #166

Conversation

jspaezp commented Jan 19, 2025 • edited Loading

Initial support for LFQ on data with ion mobility.

Speedup on the generation of databases when large number of peptides are redundant.

Extras

jspaezp Jan 19, 2025

Choose a reason for hiding this comment

jspaezp Jan 21, 2025

Choose a reason for hiding this comment

jspaezp commented Jan 21, 2025

jspaezp commented Jan 19, 2025 •

edited

Loading