Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.11.0 RC1 #132

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open

Conversation

jspaezp
Copy link
Collaborator

@jspaezp jspaezp commented Dec 5, 2024

What does this pr do:

  1. adds back the flashlfq support.
  2. fixes the python API to work with the docs.
  3. centralizes linting/formatting expectations to a makefile.
  4. migrates dependency management to uv.
  5. Adds PR template.
  6. Centralizes column inference and meaning logic into specific dataclasses (mokapot.column_defs)
  7. Returns semantic meaning to the column definitions (also specified in the column_defs)
  8. Re-adds support for compound primary keys as an unit of identification for PSMs
  9. Refactors a lot of the confidence logic.
    Addresses:

Notes:

Most of the lines added are just this file: data/phospho_rep1.traditional.pin
Which is a backport of the original testing file (phospho_rep1.pin which was ""fixed"" in another PR by removing the ragged aspect of the protein column PROT1\tPROT2 -> PROT1:PROT2)

Blockers:

Unhandled things:

  1. No idea why windows is failing tests. - edit: Handled updating numpy to v2
  2. Update docs/docstrings/vignettes - edit: partial update done

jspaezp and others added 12 commits September 6, 2024 16:48
* ✨ cherry picks internal fixes from !68 and !70

* Cherry pick feature/confidence_streaming branch

* ✨ adds filelock dependency for tests

* 💄 linting

* 💄 reformat to satisfy linter
k

* ✨ imports type annotations from future for python 3.9

* ✨ make pytest and cli behave with type annotations in Python 3.9

* ✨ test dropping Python 3.9 support

- inspired by
  https://github.com/wfondrie/mokapot/pull/126/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957fL20

* Set scale_to_one to false in *all* cases

* Fixed path problems probably causing errors under windows

* Fix more possible path issues

* Fix warning about bitwise not in python 3.12

* Fix problem with numpy 2.x's different str rep of floats

* Make hashing of rows for splitting independent of numpy version and spectra columns

* Feature/streaming fix windows (wfondrie#48)

* ✨ log more infos
* ✨ uses uv for env setup; fix dependencies

---------

Co-authored-by: Elmar Zander <[email protected]>
Fixed retention time division by 60.
Time is required in minutes for FlashLFQ, it's already in minutues

Co-authored-by: William Fondrie <[email protected]>
)

CSV_SUFFIXES = [".csv", ".pin", ".tab", ".csv"]
CSV_SUFFIXES = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record... I still dislike naming so many tab-delimited file formats as "comma separated values (csv)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolute agree. I just don't see a better way, as those other extensions are already out their in the wild.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall any tool that generates a tab delimited .csv of the top of my head. Do you happen to have an example? (I wont deal with it in this PR but in the future we could split csv-tsv formats internally)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I somehow misread your initial comment. Yes, since we really never have "comma-separated" values anywhere, why not get completely rid of it and replace "comma separated/CSV" with "tab separated/TSV" everywhere.

For the record: when I started on this code base, it was something with "comma separated" everywhere, but a separator variable sep was passed around, which was always set to "\t". I got rid of all the explicit file reading/writing stuff and moved that into the readers/writers, set the separator (I think) unconditionally to "\t", but did not rename the variables/classes. So: my bad ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I think adding support for .csv would be a good idea in the future (comma separated file)

@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 5, 2024

Edit: a7401c3 does some progress, figured out the confidence but still need to "pipe" some columns needed by flashlfq, since _optional_columns was removed as an attribute from the confidence object

@gessulat and @ezander

I might need help with this one to understand how to update the documentation.

Right now if I try to do this (part of tests/unit_tests/test_writer_flashlfq.py):

# Using the psms_ondisk fixture from your tests ...
def test_sanity(psms_ondisk, tmp_path):
    """Run simple sanity checks"""

    mods, scores = mokapot.brew([psms_ondisk])
    conf = mokapot.assign_confidence(
        [psms_ondisk],
        scores_list=scores,
        eval_fdr=0.05,
        deduplication=False,  # RN fails with deduplication = True with an error saying that the column "ExpMass" does not exist 
    )

# When set to dedup=False it fails with error ` KeyError: 'proteinIds'`

so .... where are these columns specified? how can one assign confidence without proteins?

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/mokapot/confidence.py#L380-L388

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/tests/unit_tests/test_writer_flashlfq.py#L8-L19

@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 7, 2024

Note:

There seems to be a difference on what 'OnDiskDataset' and 'LinearPsmDataset' mean by spectra:

on disk psm is all of these:

  ... labels = find_required_column("label", columns)

  # Optional columns
    filename = find_optional_column(filename_column, columns, "filename")
    calcmass = find_optional_column(calcmass_column, columns, "calcmass")
    expmass = find_optional_column(expmass_column, columns, "expmass")
    ret_time = find_optional_column(rt_column, columns, "ret_time")
    charge = find_optional_column(charge_column, columns, "charge_column")
    spectra = [c for c in [filename, scan, ret_time, expmass] if c is not None]

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/parsers/pin.py#L223-L232

and the linear psm defines it as

spectrum_columns : str or tuple of str
        The column(s) that collectively identify unique mass spectra. Multiple
        columns can be useful to avoid combining scans from multiple mass
        spectrometry runs.

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/dataset.py#L255-L260

which would seem more like the OnDisk ... of specId_column (the linear psm uses as an index the compound index made from the columns defined by 'spectrum_columns' whilst the on disk dataset assumes there is a single column that can be used as a primary index).

@jspaezp jspaezp changed the title [WIP] Feature/auto pin handling2 [WIP] v0.11.0 RC Jan 2, 2025
@jspaezp
Copy link
Collaborator Author

jspaezp commented Jan 31, 2025


set -x
set -e

PINFILE="$HOME/git/sage_tdf_ms1_v2/sage_results_pin/results.sage.pin"
PINFILE_IMPUTED="$HOME/git/sage_tdf_ms1_v2/sage_results_pin/results.sage.imputed.pin"
FASTA_FILE="$HOME/Downloads/allcoonsequences2024.fasta"

cat $PINFILE | sed -e "s/NaN/10.00/g" > $PINFILE_IMPUTED

/usr/bin/time -l uv run --no-config --with "numpy < 2.0" --with "pandas < 2.0" \
  --with "mokapot @ git+https://github.com/wfondrie/[email protected]" -p 3.9\
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_38_release $PINFILE_IMPUTED
/usr/bin/time -l uv run --no-config --with "numpy < 2.0" --with "pandas < 2.0" \
  --with "mokapot @ git+https://github.com/wfondrie/mokapot.git@main" -p 3.10 \
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_310_main $PINFILE_IMPUTED
/usr/bin/time -l uv run --no-config \
  --with "mokapot @ git+https://github.com/jspaezp/mokapot.git@feature/auto_pin_handling2" -p 3.12 \
  mokapot --proteins $FASTA_FILE --dest_dir mokapot_312_results $PINFILE_IMPUTED

tree mokapot_*

for x in peptide protein psm; do
  echo "===== $x ====="
  head -2 mokapot_*/*$x*
done

for x in peptide protein psm; do
  echo "===== $x ====="
  wc -l mokapot_*/*$x*
done

Changes in the output:

note:

  • mokapot_310_main = current main branch
  • mokapot_312_results = this PR
  • mokapot_38_release = current release

Original column names in the pin file

SpecId  Label   ScanNr  ExpMass CalcMass        FileName
retentiontime   ion_mobility    rank    z=2     z=3     z=4
z=5     z=6     z=other peptide_len     missed_cleavages
semi_enzymatic  isotope_error   ln(precursor_ppm)       fragment_ppm
ln(hyperscore)  ln(delta_next)  ln(delta_best)  aligned_rt
predicted_rt    sqrt(delta_rt_model)  predicted_mobility
sqrt(delta_mobility)    matched_peaks   longest_b       longest_y
longest_y_pct   ln(matched_intensity_pct)       scored_candidates
ln(-poisson)  posterior_error  Peptide Proteins

Several columns changed from release -> main and back again from
main -> this PR

===== peptide =====
==> mokapot_310_main/targets.peptides.csv <==
PSMId   peptide score   q-value posterior_error_prob    proteinIds
94362   LFGNMEGDC[+57.0215]PSDWK        4.422188798143788       5.3949072025716305e-05  1.3279592835191726e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_312_results/targets.peptides.tsv <==
SpecId  FileName        ScanNr  ExpMass peptide score   mokapot_qvalue  posterior_error_prob    proteinIds
94362   09272020_PladB-4hr-5_Slot1-67_1_2098.d  101458  1654.6742       LFGNMEGDC[+57.0215]PSDWK        4.552594172421302       5.3253807e-05   5.975924587263371e-14   ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_38_release/mokapot.peptides.txt <==
SpecId  Label   ScanNr  ExpMass CalcMass        FileName        Peptide mokapot score   mokapot q-value mokapot PEP     Proteins
94362   True    101458  1654.6742       1654.6757       09272020_PladB-4hr-5_Slot1-67_1_2098.d  LFGNMEGDC[+57.0215]PSDWK        5.660280381349133       5.621767483696874e-05   4.277013148574386e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

Currently (this pr vs last release)

  1. label is not in the peptide out.
  2. the columns are not capitalized (even though they are in the input file)
  3. Proteins gets renamed to proteinIds
  4. mokapot score -> score
  5. mokapot q-value -> mokapot_qvalue
  6. mokapot PEP -> posterior_error_prob

===== protein =====
==> mokapot_310_main/targets.proteins.csv <==
mokapot protein group   best peptide    stripped sequence       score   q-value posterior_error_prob
"ENSP00000307423        r11092:T324S, GENSCAN00000036155, ENSP00000307423.2, sp|Q10469|MGAT2_HUMAN"     IFHAGDC[+57.0215]GMHHK  IFHAGDCGMHHK    4.034602465755392       0.001497006043791771   4.657116363360999e-08

==> mokapot_312_results/targets.proteins.tsv <==
mokapot protein group   best peptide    stripped sequence       score   mokapot_qvalue  posterior_error_prob
"ENSP00000307423        r11092:T324S, GENSCAN00000036155, ENSP00000307423.2, sp|Q10469|MGAT2_HUMAN"     IFHAGDC[+57.0215]GMHHK  IFHAGDCGMHHK    4.159152488171724       0.00147929    3.15034802432272e-08

==> mokapot_38_release/mokapot.proteins.txt <==
mokapot protein group   best peptide    stripped sequence       mokapot score   mokapot q-value mokapot PEP
"ENSP00000225388        r16415:H45P, GENSCAN00000003560, ENSP00000225388.3, sp|Q7Z417|NUFP2_HUMAN"      GADNDGSGSESGYTTPK       GADNDGSGSESGYTTPK       4.347667255848529       0.0015873015873015873  2.238033114813839e-08

Currently:

  1. mokapot q-value -> mokapot_qvalue
  2. mokapot PEP -> posterior_error_prob
===== psm =====
==> mokapot_310_main/targets.psms.csv <==
PSMId   peptide score   q-value posterior_error_prob    proteinIds
94362   LFGNMEGDC[+57.0215]PSDWK        4.422188798143788       3.0441400667768903e-05  3.6080280142001793e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_312_results/targets.psms.tsv <==
SpecId  FileName        ScanNr  ExpMass peptide score   mokapot_qvalue  posterior_error_prob    proteinIds
94362   09272020_PladB-4hr-5_Slot1-67_1_2098.d  101458  1654.6742       LFGNMEGDC[+57.0215]PSDWK        4.552594172421302       2.9895366e-05   1.6038471672165864e-13  ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

==> mokapot_38_release/mokapot.psms.txt <==
SpecId  Label   ScanNr  ExpMass CalcMass        FileName        Peptide mokapot score   mokapot q-value mokapot PEP     Proteins
94362   True    101458  1654.6742       1654.6757       09272020_PladB-4hr-5_Slot1-67_1_2098.d  LFGNMEGDC[+57.0215]PSDWK        5.660280381349133       3.332444681418288e-05   5.3656986367161e-14    ENSP00000353224;ENSP00000353224.4;ENSP00000376197;ENSP00000376197.3;ENSP00000390133;ENSP00000390133.1;GENSCAN00000015036;sp|P02786|TFR1_HUMAN;tr|G3V0E5|G3V0E5_HUMAN

Current:

  1. CalcMass -> not anymore
  2. Label -> not anymore
  3. Peptide -> peptide
  4. mokapot score -> score
  5. mokapot q-value -> mokapot_qvalue
  6. mokapot PEP -> posterior_error_prob

IMO: It feels inconsistent to have generated columns that have spaces and ones that dont,
same for the mokapot prefix on column. I get the impetus to make stuff more sql-friendly but there is some consistency to work with here.

All line counts are the same! wooo!

===== peptide =====
   83260 mokapot_310_main/targets.peptides.csv
   83260 mokapot_312_results/targets.peptides.tsv
   83260 mokapot_38_release/mokapot.peptides.txt
  249780 total
===== protein =====
   14469 mokapot_310_main/targets.proteins.csv
   14484 mokapot_312_results/targets.proteins.tsv
   14492 mokapot_38_release/mokapot.proteins.txt
   43445 total
===== psm =====
  147855 mokapot_310_main/targets.psms.csv
  147855 mokapot_312_results/targets.psms.tsv
  147855 mokapot_38_release/mokapot.psms.txt
  443565 total

@jspaezp jspaezp changed the title [WIP] v0.11.0 RC v0.11.0 RC1 Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants