Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

alex-l-kong · 2024-10-15T20:41:58Z

What is the purpose of this PR?

Using SCiLS to extract spectra information is a long and cumbersome process, and we have no control over the development process. To expedite this, we implement our own functionality using Python.

How did you implement your changes

Initially, timsconvert was looked into as a solution. There were a few problem with this approach:

Poor support for multiple runs
Lack of support for normalization techniques (ex. TIC)
Inconsistent .imzml file definitions
Sub-optimal pyimzml memory usage

To address the issues with SCiLS and timsconvert, the dask library is leveraged. This offers us several benefits:

Parallel computation/latency reduction: using the map_partitions function, code can run in parallel across several blocks in a DataFrame. This significantly reduces the time required to extract each spectra, which would normally happen on a per-spot basis. With a 16-core machine, we could theoretically achieve a 16x or even a 32x speedup.
Out-of-memory usage: lazy loading allows us to keep only part of the DataFrame stored in memory at once.

One challenge with SCiLS is the bins used to define the m/z peaks. This is done to reduce the number of m/z datapoints that are used for downstream analysis. However, because only raw spectra is extracted, we need to do the following:

Implement TIC normalization (TODO)
Reverse engineer a binning function that approximates the peaks.

The binning function is run using raw_mz / 200 / 1000` to define the endpoints. This closely matches what SCiLS does, since the conversion happens to mDa.

The pyTDFSDK library provides a simple connection/cursor workflow that allows us to easily query each spot for their corresponding spectra value.

Remaining issues

pyTDFSDK is a Windows-only package, meaning it will not be possible to test on Mac OS.

Aggregating spots across different runs is still a challenge because de-identification is needed. This will be a WIP.

review-notebook-app · 2024-10-15T20:42:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2024-10-15T20:49:45Z

Pull Request Test Coverage Report for Build 11354056037

Details

0 of 67 (0.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-31.3%) to 65.603%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/maldi_tools/load_maldi_data.py	0	67	0.0%

Totals
Change from base Build 6777233943:	-31.3%
Covered Lines:	155
Relevant Lines:	223

💛 - Coveralls

…pts...

New dask-based workflow

05ad9ce

alex-l-kong self-assigned this Oct 15, 2024

alex-l-kong changed the title ~~Implement workflow using pyTDFSDK and dask~~ Implement spectra extraction workflow using pyTDFSDK and dask Oct 15, 2024

alex-l-kong added 15 commits October 24, 2024 11:00

Smorgasbord of changes, all of which are failed parallelization attem…

5e8622d

…pts...

Simplify workflow to single-threaded

63f8ccd

Fix binned_mz indexing scheme

6f89473

Reformatting

dcec39e

Full multiprocessing support added

87e5a68

OCD formatting

9cbd2e3

Clarify comment about the path to the binary file

1d72c30

Upload simplified copy of the maldi-load.ipynb notebook

a4592bd

Add TIC normalization

016f631

Allow option for TIC normalization for testing and flexibility

69d32e1

Use logarithmic binning method from SCiLS

ab42f60

Binning method fully finalized

3b14661

Add error check for num_bins

d02340d

Properly handle pixels with zero-sum intensity

e164243

Add SCiLS-based TIC normalization

789312c

alex-l-kong changed the title ~~Implement spectra extraction workflow using pyTDFSDK and dask~~ Implement spectra extraction workflow using pyTDFSDK with parallelization Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

alex-l-kong commented Oct 15, 2024

review-notebook-app bot commented Oct 15, 2024

coveralls commented Oct 15, 2024

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Are you sure you want to change the base?

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Conversation

alex-l-kong commented Oct 15, 2024

review-notebook-app bot commented Oct 15, 2024

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 11354056037

Details

💛 - Coveralls

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26