Combine mokapot scores from multiple files #97

wfondrie · 2023-04-06T17:48:07Z

One hindrance to large-scale analyses across many runs is that PSMs from all of the runs must be concatenated and read into memory at the same time. This creates a memory and compute bottleneck unless subset_max_train is used. However, this parameter only currently alleviates the model-training compute bottleneck in mokapot.

Instead, it should be possible to run the full mokapot algorithm on each run individually, then aggregate the new scores for FDR estimation. Each run would be re-scored using its own cross-validated models---notably these are already calibrated to combine the cross-validated predictions for FDR estimation. We could then combine only the scores and spectrum identifiers in a separate FDR estimation step, massively reducing the required memory.

This is also nice, because the compute bottleneck could be trivially parallelized!

The text was updated successfully, but these errors were encountered:

wfondrie added the enhancement New feature or request label Apr 6, 2023

wfondrie self-assigned this Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine mokapot scores from multiple files #97

Combine mokapot scores from multiple files #97

wfondrie commented Apr 6, 2023

Combine mokapot scores from multiple files #97

Combine mokapot scores from multiple files #97

Comments

wfondrie commented Apr 6, 2023