Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/ims lfq #166

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Feat/ims lfq #166

wants to merge 10 commits into from

Conversation

jspaezp
Copy link
Contributor

@jspaezp jspaezp commented Jan 19, 2025

This PR actually contains 2 features that are not really related to one another (which ended up combined bc of skill issues using git worktrees … LMK if you really want me to split them into separate PRs)

Initial support for LFQ on data with ion mobility.

This supersedes: #156

What are the key elements of this solution?

  • Adds the parsing-reading of the IM-containing spectra from tdf files.
    • Storing all peaks was very slow, so I also added a way to do some
      centroiding on the data. Which is very rudimentary right now but
      seems to work well.
  • Splits the ProcessedSpectrum struct into two variants, one where the elements
    are ProcessedSpectrum<Peak>s (mz-int) and one where they are
    ProcessedSpectrum<IMPeak>s (mz-int-mobility).
  • Adds a new MS1Spectra enum that can be either NoMobility or WithMobility
    which is passed with the search results to the LFQ code (so it filters on IM
    on top of the mz).
  • Implements the differential handling of the two variants in the FeatureMap
  • Adds a variant of the Tolerance enum to be 'Pct' (percent), since it makes a lot
    more sense for the scale of the mobility.

The bulk of the changes for this are ->

  • crates/sage-cloudpath/src/tdf.rs
  • crates/sage/src/lfq.rs
  • crates/sage/src/spectrum.rs

Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?

  • I believe this greatly improves the quant quality because it removes noise from irrelevant
    sections of the mobility range (which makes this PR supersede the previous attempt).
  • Tradeoffs:
    • It adds one f32 to the size of the PrecursorRange and one optional vec to the size
      of the RawSpectrum (which I feel should be fine) ... In theory the RawSpectrum
      could be written to a generic and that gets propagated throughout the program
      which would make the change 'zero-cost' but it felt really hard to do... (it is
      a lot of re-writing + added complexity).
  • Right now the mobility range used for centroiding (but not LFQ) is hard-coded, we
    could add it as parameters to the config and propagate it throughout the program.
    I am not sure where this would go in the input file (either close to min_peak
    or within the bruker processor)

Speedup on the generation of databases when large number of peptides are redundant.

This solution drops the library generation time from 45-50 seconds to 12-16 on a particularly redundant database in my laptop.

What are the key elements of this solution?

There are two/three main changes:

  1. Changing the instances of Aec<String> to Arc<str> (which were used to track the
    protein accessions). This speeds up the comparison of the sequences since the arc
    pointer goes directly to the sequence, instead of going to the String that points
    to the sequence (... I think). and (I think) saves a couple of bytes in the heap
    (not in number of allocations but in final size).
  2. Additional deduplication stage at the digest level (which is done introducing the
    DigestGroup struct). This dramatically speeds up building libraries where many
    proteins share peptides (isoform fasta files for instance).
  3. Using the monoisoropic mass for the comparison of the peptide for deduplication.
    These were being used for sorting but not for deduplication.

Why did you design your solution this way? Did you assess any alternatives? Are there tradeoffs?

  • In theory this will be a bit slower in libraries that have many single-peptide-unique
    proteins, where nothing needs to be deduplicated but dont see it being a big deal in
    practice.

Extras

Does this PR require a change to the docs?

  • No.
  • Yes, and I have opened a PR at ...
  • Yes; I haven't opened a PR, but the gist of the change is:
    • New parameter for the IM tolerance in the LFQ section of the docs.
    • I also noticed that none of the config for the bruker processing is documented ...

Did you add or update tests for this change?

  • Yes. The only changes in tests were to change Arc::new(String) to Arc::from(String).
  • No, I believe tests aren't necessary. No tests were added but the changes are tested by sage_core::database::digestion test.

Please complete the following checklist:

  • I have added an entry to CHANGELOG.md, under the [Unreleased] section heading. That entry references the issue closed by this PR.
  • I acknowledge Sage's license. I do not own my contribution.

NOTE: teplate for the PR is here: https://raw.githubusercontent.com/tconbeer/harlequin/refs/heads/main/.github/PULL_REQUEST_TEMPLATE.md

Extras:

  • I am not the biggest fan of how SpectrumReaderConfig is implemented RN ... I think we should re-write it to a config within sage (which implements Into:: -> to the timsrust-native ones) ... also this is not documented anywhere ... (which I guess is fine bc its pretty experimental ...)
  • @sander-willems-bruker How hard would it be to implement a way to get the Metadata struct from a reader? RN I am doing this in a very hacky way (looking for the analysis.tdf, but I am not sure if this would work with the mini-tdf or the other variants supported by timsrust)

@@ -19,7 +19,34 @@ impl FromParallelIterator<SageResults> for SageResults {
.reduce(SageResults::default, |mut acc, x| {
acc.features.extend(x.features);
acc.quant.extend(x.quant);
acc.ms1.extend(x.ms1);
match (acc.ms1, x.ms1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this a "method" for the enum instead of being here, and repeated with the serial implementation.


for &idx in &order {
if intensity_array[idx] <= 0.0 {
// In theory ... if I set the intensity as mutable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove outdated comment ...

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 21, 2025

Oddly enough ... this also seems to make the search faster ... I am not 100% sure why ... (pre-splitting the ms2 spectra is more cache friendly? ... dont think so, since we didnt have ms1s before ...)

>>> MASTER
[2025-01-19T23:00:20Z INFO  sage::runner] generated 412346146 fragments, 12814433 peptides in 44020ms
[2025-01-19T23:03:31Z INFO  sage::runner] - search:    168367 ms (1220 spectra/s)
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 0 target MS1 peaks at 5% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 102338 target peptide-spectrum matches at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 41553 target peptides at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] discovered 17550 target proteins at 1% FDR
[2025-01-19T23:03:37Z INFO  sage::runner] finished in 240s
+bench.bash:27> echo '>>> LFQ'
>>> LFQ
[2025-01-19T23:08:10Z INFO  sage::runner] generated 412346146 fragments, 12814433 peptides in 16.701664708s
[2025-01-19T23:10:51Z INFO  sage::runner] - search:    111926 ms (1836 spectra/s)
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 25156 target MS1 peaks at 5% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 102338 target peptide-spectrum matches at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 41553 target peptides at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] discovered 17550 target proteins at 1% FDR
[2025-01-19T23:10:58Z INFO  sage::runner] finished in 184s

This is the script I used to benchmark (not super strict benchmark)

set -x
set -e

git checkout master
cargo b --release --bin sage
cp target/release/sage sage_master
git checkout feat/ims_lfq
cargo b --release --bin sage
cp target/release/sage sage_lfq

FASTA_USE="Downloads/allcoonsequences2024.fasta"
FILE_USE="Downloads/09272020_PladB-4hr-5_Slot1-67_1_2098.d"

for i in {1..5}; do
  echo $i
  echo ">>> MASTER"
  SAGE_LOG=trace ./sage_master --fasta $FASTA_USE tdf_sageconfig_dda.json $FILE_USE
  echo "Done"
  echo ">>> LFQ"
  SAGE_LOG=trace ./sage_lfq --fasta $FASTA_USE tdf_sageconfig_dda.json $FILE_USE
  echo "Done"
done

and then logs are filtered with grep -E "MASTER|BRANCH|>>>|search:|(^\d)|FDR|generated" (showing the results only of the last iteration, but all look pretty similar)

Data from @bsphinney, which I am not sure I can make public but I can DM you the link to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant