v0.11.0 RC1 #132

jspaezp · 2024-12-05T04:00:33Z

What does this pr do:

adds back the flashlfq support.
fixes the python API to work with the docs.
centralizes linting/formatting expectations to a makefile.
migrates dependency management to uv.
Adds PR template.
Centralizes column inference and meaning logic into specific dataclasses (mokapot.column_defs)
Returns semantic meaning to the column definitions (also specified in the column_defs)
Re-adds support for compound primary keys as an unit of identification for PSMs
Refactors a lot of the confidence logic.
Addresses:

Notes:

Most of the lines added are just this file: data/phospho_rep1.traditional.pin
Which is a backport of the original testing file (phospho_rep1.pin which was ""fixed"" in another PR by removing the ragged aspect of the protein column PROT1\tPROT2 -> PROT1:PROT2)

Blockers:

Windows tests fail bc of numpy v1.x, updating to v2 needs triqler to bump the debendency. PR is accepted but not released. chore: update numpy to v2 statisticalbiotechnology/triqler#34 - edit: Released

Unhandled things:

No idea why windows is failing tests. - edit: Handled updating numpy to v2
Update docs/docstrings/vignettes - edit: partial update done

* ✨ cherry picks internal fixes from !68 and !70 * Cherry pick feature/confidence_streaming branch * ✨ adds filelock dependency for tests * 💄 linting * 💄 reformat to satisfy linter k * ✨ imports type annotations from future for python 3.9 * ✨ make pytest and cli behave with type annotations in Python 3.9 * ✨ test dropping Python 3.9 support - inspired by https://github.com/wfondrie/mokapot/pull/126/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957fL20 * Set scale_to_one to false in *all* cases * Fixed path problems probably causing errors under windows * Fix more possible path issues * Fix warning about bitwise not in python 3.12 * Fix problem with numpy 2.x's different str rep of floats * Make hashing of rows for splitting independent of numpy version and spectra columns * Feature/streaming fix windows (wfondrie#48) * ✨ log more infos * ✨ uses uv for env setup; fix dependencies --------- Co-authored-by: Elmar Zander <[email protected]>

Fixed retention time division by 60. Time is required in minutes for FlashLFQ, it's already in minutues Co-authored-by: William Fondrie <[email protected]>

jspaezp · 2024-12-05T19:24:17Z

mokapot/tabular_data/format_chooser.py

 )

-CSV_SUFFIXES = [".csv", ".pin", ".tab", ".csv"]
+CSV_SUFFIXES = [


For the record... I still dislike naming so many tab-delimited file formats as "comma separated values (csv)"

I absolute agree. I just don't see a better way, as those other extensions are already out their in the wild.

I don't recall any tool that generates a tab delimited .csv of the top of my head. Do you happen to have an example? (I wont deal with it in this PR but in the future we could split csv-tsv formats internally)

Sorry, you're right. I somehow misread your initial comment. Yes, since we really never have "comma-separated" values anywhere, why not get completely rid of it and replace "comma separated/CSV" with "tab separated/TSV" everywhere.

For the record: when I started on this code base, it was something with "comma separated" everywhere, but a separator variable sep was passed around, which was always set to "\t". I got rid of all the explicit file reading/writing stuff and moved that into the readers/writers, set the separator (I think) unconditionally to "\t", but did not rename the variables/classes. So: my bad ;)

To be clear, I think adding support for .csv would be a good idea in the future (comma separated file)

jspaezp · 2024-12-05T21:19:02Z

Edit: a7401c3 does some progress, figured out the confidence but still need to "pipe" some columns needed by flashlfq, since _optional_columns was removed as an attribute from the confidence object

@gessulat and @ezander

I might need help with this one to understand how to update the documentation.

Right now if I try to do this (part of tests/unit_tests/test_writer_flashlfq.py):

# Using the psms_ondisk fixture from your tests ...
def test_sanity(psms_ondisk, tmp_path):
    """Run simple sanity checks"""

    mods, scores = mokapot.brew([psms_ondisk])
    conf = mokapot.assign_confidence(
        [psms_ondisk],
        scores_list=scores,
        eval_fdr=0.05,
        deduplication=False,  # RN fails with deduplication = True with an error saying that the column "ExpMass" does not exist 
    )

# When set to dedup=False it fails with error ` KeyError: 'proteinIds'`

so .... where are these columns specified? how can one assign confidence without proteins?

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/mokapot/confidence.py#L380-L388

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/tests/unit_tests/test_writer_flashlfq.py#L8-L19

jspaezp · 2024-12-07T03:55:11Z

Note:

There seems to be a difference on what 'OnDiskDataset' and 'LinearPsmDataset' mean by spectra:

on disk psm is all of these:

  ... labels = find_required_column("label", columns)

  # Optional columns
    filename = find_optional_column(filename_column, columns, "filename")
    calcmass = find_optional_column(calcmass_column, columns, "calcmass")
    expmass = find_optional_column(expmass_column, columns, "expmass")
    ret_time = find_optional_column(rt_column, columns, "ret_time")
    charge = find_optional_column(charge_column, columns, "charge_column")
    spectra = [c for c in [filename, scan, ret_time, expmass] if c is not None]

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/parsers/pin.py#L223-L232

and the linear psm defines it as

spectrum_columns : str or tuple of str
        The column(s) that collectively identify unique mass spectra. Multiple
        columns can be useful to avoid combining scans from multiple mass
        spectrometry runs.

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/dataset.py#L255-L260

which would seem more like the OnDisk ... of specId_column (the linear psm uses as an index the compound index made from the columns defined by 'spectrum_columns' whilst the on disk dataset assumes there is a single column that can be used as a primary index).

Chore/fix confidence api

…ion to remove numba

… in docstrings

feat, wip: compound key on spectrum

jspaezp · 2025-01-31T00:04:40Z


set -x
set -e

PINFILE="$HOME/git/sage_tdf_ms1_v2/sage_results_pin/results.sage.pin"
PINFILE_IMPUTED="$HOME/git/sage_tdf_ms1_v2/sage_results_pin/results.sage.imputed.pin"
FASTA_FILE="$HOME/Downloads/allcoonsequences2024.fasta"

cat $PINFILE | sed -e "s/NaN/10.00/g" > $PINFILE_IMPUTED

/usr/bin/time -l uv run --no-config --with "numpy < 2.0" --with "pandas < 2.0" \
  --with "mokapot @ git+https://github.com/wfondrie/[email protected]" -p 3.9\
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_38_release $PINFILE_IMPUTED
/usr/bin/time -l uv run --no-config --with "numpy < 2.0" --with "pandas < 2.0" \
  --with "mokapot @ git+https://github.com/wfondrie/mokapot.git@main" -p 3.10 \
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_310_main $PINFILE_IMPUTED
/usr/bin/time -l uv run --no-config \
  --with "mokapot @ git+https://github.com/jspaezp/mokapot.git@feature/auto_pin_handling2" -p 3.12 \
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_312_results $PINFILE_IMPUTED

tree mokapot_*

for x in peptide protein psm; do
  echo "===== $x ====="
  head -2 mokapot_*/*$x*
done

for x in peptide protein psm; do
  echo "===== $x ====="
  wc -l mokapot_*/*$x*
done

Changes in the output:

note:

mokapot_310_main = current main branch
mokapot_312_results = this PR
mokapot_38_release = current release

Original column names in the pin file

SpecId  Label   ScanNr  ExpMass CalcMass        FileName
retentiontime   ion_mobility    rank    z=2     z=3     z=4
z=5     z=6     z=other peptide_len     missed_cleavages
semi_enzymatic  isotope_error   ln(precursor_ppm)       fragment_ppm
ln(hyperscore)  ln(delta_next)  ln(delta_best)  aligned_rt
predicted_rt    sqrt(delta_rt_model)  predicted_mobility
sqrt(delta_mobility)    matched_peaks   longest_b       longest_y
longest_y_pct   ln(matched_intensity_pct)       scored_candidates
ln(-poisson)  posterior_error  Peptide Proteins

Several columns changed from release -> main and back again from
main -> this PR

===== peptide =====
==> mokapot_310_main/targets.peptides.csv <==
PSMId   peptide score   q-value posterior_error_prob    proteinIds
94362   LFGNMEGDC[+57.0215]PSDWK        4.422188798143788       5.3949072025716305e-05  1.3279592835191726e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_312_results/targets.peptides.tsv <==
SpecId  FileName        ScanNr  ExpMass peptide score   mokapot_qvalue  posterior_error_prob    proteinIds
94362   09272020_PladB-4hr-5_Slot1-67_1_2098.d  101458  1654.6742       LFGNMEGDC[+57.0215]PSDWK        4.552594172421302       5.3253807e-05   5.975924587263371e-14   ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_38_release/mokapot.peptides.txt <==
SpecId  Label   ScanNr  ExpMass CalcMass        FileName        Peptide mokapot score   mokapot q-value mokapot PEP     Proteins
94362   True    101458  1654.6742       1654.6757       09272020_PladB-4hr-5_Slot1-67_1_2098.d  LFGNMEGDC[+57.0215]PSDWK        5.660280381349133       5.621767483696874e-05   4.277013148574386e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

Currently (this pr vs last release)

label is not in the peptide out.
the columns are not capitalized (even though they are in the input file)
Proteins gets renamed to proteinIds
mokapot score -> score
mokapot q-value -> mokapot_qvalue
mokapot PEP -> posterior_error_prob


===== protein =====
==> mokapot_310_main/targets.proteins.csv <==
mokapot protein group   best peptide    stripped sequence       score   q-value posterior_error_prob
"ENSP00000307423        r11092:T324S, GENSCAN00000036155, ENSP00000307423.2, sp|Q10469|MGAT2_HUMAN"     IFHAGDC[+57.0215]GMHHK  IFHAGDCGMHHK    4.034602465755392       0.001497006043791771   4.657116363360999e-08

==> mokapot_312_results/targets.proteins.tsv <==
mokapot protein group   best peptide    stripped sequence       score   mokapot_qvalue  posterior_error_prob
"ENSP00000307423        r11092:T324S, GENSCAN00000036155, ENSP00000307423.2, sp|Q10469|MGAT2_HUMAN"     IFHAGDC[+57.0215]GMHHK  IFHAGDCGMHHK    4.159152488171724       0.00147929    3.15034802432272e-08

==> mokapot_38_release/mokapot.proteins.txt <==
mokapot protein group   best peptide    stripped sequence       mokapot score   mokapot q-value mokapot PEP
"ENSP00000225388        r16415:H45P, GENSCAN00000003560, ENSP00000225388.3, sp|Q7Z417|NUFP2_HUMAN"      GADNDGSGSESGYTTPK       GADNDGSGSESGYTTPK       4.347667255848529       0.0015873015873015873  2.238033114813839e-08

Currently:

mokapot q-value -> mokapot_qvalue
mokapot PEP -> posterior_error_prob

===== psm =====
==> mokapot_310_main/targets.psms.csv <==
PSMId   peptide score   q-value posterior_error_prob    proteinIds
94362   LFGNMEGDC[+57.0215]PSDWK        4.422188798143788       3.0441400667768903e-05  3.6080280142001793e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_312_results/targets.psms.tsv <==
SpecId  FileName        ScanNr  ExpMass peptide score   mokapot_qvalue  posterior_error_prob    proteinIds
94362   09272020_PladB-4hr-5_Slot1-67_1_2098.d  101458  1654.6742       LFGNMEGDC[+57.0215]PSDWK        4.552594172421302       2.9895366e-05   1.6038471672165864e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_38_release/mokapot.psms.txt <==
SpecId  Label   ScanNr  ExpMass CalcMass        FileName        Peptide mokapot score   mokapot q-value mokapot PEP     Proteins
94362   True    101458  1654.6742       1654.6757       09272020_PladB-4hr-5_Slot1-67_1_2098.d  LFGNMEGDC[+57.0215]PSDWK        5.660280381349133       3.332444681418288e-05   5.3656986367161e-14    ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

Current:

CalcMass -> not anymore
Label -> not anymore
Peptide -> peptide
mokapot score -> score
mokapot q-value -> mokapot_qvalue
mokapot PEP -> posterior_error_prob

IMO: It feels inconsistent to have generated columns that have spaces and ones that dont,
same for the mokapot prefix on column. I get the impetus to make stuff more sql-friendly but there is some consistency to work with here.

All line counts are the same! wooo!

===== peptide =====
   83260 mokapot_310_main/targets.peptides.csv
   83260 mokapot_312_results/targets.peptides.tsv
   83260 mokapot_38_release/mokapot.peptides.txt
  249780 total
===== protein =====
   14469 mokapot_310_main/targets.proteins.csv
   14484 mokapot_312_results/targets.proteins.tsv
   14492 mokapot_38_release/mokapot.proteins.txt
   43445 total
===== psm =====
  147855 mokapot_310_main/targets.psms.csv
  147855 mokapot_312_results/targets.psms.tsv
  147855 mokapot_38_release/mokapot.psms.txt
  443565 total

chore: unify naming schemas

jspaezp and others added 12 commits September 6, 2024 16:48

(feat) added auto handling of traditional pin and testing

d9d91dc

(fix) added handling of default direction

31dff36

(fix) changed intermediate files pin->tsv and fixed tests accordingly

5940500

(chore) formatting on docs and removed T20 from them

a64c1f2

(chore) upgraded to upstream actions

1ff50c5

(chore) removed unused dependency in docs

1809edd

(chore) reformatted tests

3214c24

Small changes for FlashLFQ writer (wfondrie#131)

6db5964

Fixed retention time division by 60. Time is required in minutes for FlashLFQ, it's already in minutues Co-authored-by: William Fondrie <[email protected]>

wip: formatting and rebasing fixes

a17fa4c

chore: merge main

8c779fa

chore: ruff format

d6ef287

jspaezp mentioned this pull request Dec 5, 2024

Storey's method for q-value computation (draft) #133

Closed

wip,chore: re-adding flashlfq support

1d7475f

jspaezp commented Dec 5, 2024

View reviewed changes

jspaezp added 2 commits December 5, 2024 14:33

ci,fix: fixed confidence out and ci migration

0074f88

format: eof newline

08d73af

jspaezp mentioned this pull request Dec 5, 2024

(feat) added auto handling of traditional pin and testing #126

Closed

jspaezp added 5 commits December 5, 2024 17:05

wip,fix: progress to re-add flashlfq output

a7401c3

chore: uv lock and formatting

35cb9d8

chore: added pr template

ce53dee

wip: make brew generic again

d6f58ac

wip,fix: added deleter to on psm dataset

73a0e14

jspaezp added 5 commits December 7, 2024 17:08

feat: re-added flashlfq support

f7e8dbd

chore: linting + formatting

6045ade

fix: fixtures and progess in definition of cols

b02567a

test, fix: annotated/commented new fixtures

b06b01e

lint: formatting

26c4c78

jspaezp and others added 13 commits December 16, 2024 15:35

feat(confidence): add data reading api

59e649d

feat,experiment: Experimental qvalue-fdr estimation

2e43ce2

chore,docs: updated basic docs to curr api and updated typing

be91528

chore: updated basic n joint model docs code (md in progress)

680fc5b

chore: updated notebook

4ece548

chore,confidence: update docstrings

409d98d

chore,qvalue: removed commented out code

100ec58

Merge pull request #2 from jspaezp/chore/fix_confidence_api

c246df7

Chore/fix confidence api

chore: fixed line length lints in docstrings

1e7d68c

fix,sqlite: fixed path for sqlite writer

a10cb31

feat, wip: compound key on spectrum

a8df8e4

refactor,wip: centralized column group logic

614971f

refactor,dataset: broke module into files and changed tdc implementat…

1ec39df

…ion to remove numba

jspaezp changed the title ~~[WIP] Feature/auto pin handling2~~ [WIP] v0.11.0 RC Jan 2, 2025

jspaezp added 3 commits January 2, 2025 11:11

fix: fixed string to bool target col conversion and added notes on tests

44bf2e2

ci: enabled lint and test on all PRs

10b9f02

chore: updated triqler and np versions

8b5a736

jspaezp mentioned this pull request Jan 8, 2025

feat, wip: compound key on spectrum jspaezp/mokapot#3

Merged

jspaezp and others added 5 commits January 28, 2025 14:55

feat,doc: fixed empty cols in proteins and better column descriptions…

799661f

… in docstrings

test: added content testing to cli testing + csv -> tsv

012ca0e

fix: flashlfq and misc fixes

eca38a6

ci: added extra xml to test makefile

27d553a

Merge pull request #3 from jspaezp/feat/re_add_compound_index_spec

1dadad3

feat, wip: compound key on spectrum

jspaezp and others added 5 commits February 6, 2025 01:19

chore: unify naming schemas

3407d9a

chore: self-review cleanup

1d10907

fix,chore: updated makefile and fixed iterator

ac0aef7

chore: self-review cleanup

d4b4e42

Merge pull request #4 from jspaezp/chore/consolidate_names

121666f

chore: unify naming schemas

jspaezp changed the title ~~[WIP] v0.11.0 RC~~ v0.11.0 RC1 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.11.0 RC1 #132

v0.11.0 RC1 #132

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp Dec 5, 2024

ezander Dec 8, 2024

jspaezp Dec 8, 2024

ezander Dec 8, 2024

jspaezp Dec 9, 2024

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 7, 2024 •

edited

Loading

jspaezp commented Jan 31, 2025 •

edited

Loading

v0.11.0 RC1 #132

Are you sure you want to change the base?

v0.11.0 RC1 #132

Conversation

jspaezp commented Dec 5, 2024 • edited Loading

jspaezp Dec 5, 2024

Choose a reason for hiding this comment

ezander Dec 8, 2024

Choose a reason for hiding this comment

jspaezp Dec 8, 2024

Choose a reason for hiding this comment

ezander Dec 8, 2024

Choose a reason for hiding this comment

jspaezp Dec 9, 2024

Choose a reason for hiding this comment

jspaezp commented Dec 5, 2024 • edited Loading

jspaezp commented Dec 7, 2024 • edited Loading

jspaezp commented Jan 31, 2025 • edited Loading

Changes in the output:

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 7, 2024 •

edited

Loading

jspaezp commented Jan 31, 2025 •

edited

Loading