Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

alex-l-kong
Copy link
Contributor

What is the purpose of this PR?

Using SCiLS to extract spectra information is a long and cumbersome process, and we have no control over the development process. To expedite this, we implement our own functionality using Python.

How did you implement your changes

Initially, timsconvert was looked into as a solution. There were a few problem with this approach:

  1. Poor support for multiple runs
  2. Lack of support for normalization techniques (ex. TIC)
  3. Inconsistent .imzml file definitions
  4. Sub-optimal pyimzml memory usage

To address the issues with SCiLS and timsconvert, the dask library is leveraged. This offers us several benefits:

  1. Parallel computation/latency reduction: using the map_partitions function, code can run in parallel across several blocks in a DataFrame. This significantly reduces the time required to extract each spectra, which would normally happen on a per-spot basis. With a 16-core machine, we could theoretically achieve a 16x or even a 32x speedup.
  2. Out-of-memory usage: lazy loading allows us to keep only part of the DataFrame stored in memory at once.

One challenge with SCiLS is the bins used to define the m/z peaks. This is done to reduce the number of m/z datapoints that are used for downstream analysis. However, because only raw spectra is extracted, we need to do the following:

  1. Implement TIC normalization (TODO)
  2. Reverse engineer a binning function that approximates the peaks.

The binning function is run using raw_mz / 200 / 1000` to define the endpoints. This closely matches what SCiLS does, since the conversion happens to mDa.

The pyTDFSDK library provides a simple connection/cursor workflow that allows us to easily query each spot for their corresponding spectra value.

Remaining issues

pyTDFSDK is a Windows-only package, meaning it will not be possible to test on Mac OS.

Aggregating spots across different runs is still a challenge because de-identification is needed. This will be a WIP.

@alex-l-kong alex-l-kong self-assigned this Oct 15, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong alex-l-kong changed the title Implement workflow using pyTDFSDK and dask Implement spectra extraction workflow using pyTDFSDK and dask Oct 15, 2024
@coveralls
Copy link

Pull Request Test Coverage Report for Build 11354056037

Details

  • 0 of 67 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-31.3%) to 65.603%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/maldi_tools/load_maldi_data.py 0 67 0.0%
Totals Coverage Status
Change from base Build 6777233943: -31.3%
Covered Lines: 155
Relevant Lines: 223

💛 - Coveralls

@alex-l-kong alex-l-kong changed the title Implement spectra extraction workflow using pyTDFSDK and dask Implement spectra extraction workflow using pyTDFSDK with parallelization Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants