Skip to content

Commit

Permalink
Compatible 9-species data reference (#334)
Browse files Browse the repository at this point in the history
* Compatible 9-species data reference

* Generate new screengrabs with rich-codex

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
bittremieux and github-actions[bot] authored May 15, 2024
1 parent 1fcec6a commit c6a455b
Show file tree
Hide file tree
Showing 6 changed files with 304 additions and 343 deletions.
5 changes: 3 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ This will need to be set with each new shell session, or you can add it to your

The [reported Casanovo results](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).

All data for the _nine-species benchmark_ is available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C52V2CK8J).
All data for the _nine-species benchmark_ are available as annotated MGF files on MassIVE with [dataset identifier MSV000090982](https://doi.org/doi:10.25345/C52V2CK8J).
Annotated MGF files that are directly compatible with Casanovo are available in the `/MSV000090982/updates/2024-05-14_woutb_71950b89/peak/9speciesbenchmark` FTP directory.
Using these data, Casanovo was trained in a cross-validated fashion, training on eight species and testing on the remaining species.

The _MassIVE-KB training data_ was derived from PSMs used to compile the MassIVE-KB v1 spectral library and consists of 30 million PSMs.
Expand All @@ -98,7 +99,7 @@ This will give you a zipped TSV file with the metadata and peptide identificatio
Using the filename (column "filename") you can then retrieve the corresponding peak files from the MassIVE FTP server and extract the desired spectra using their scan number (column "scan").

The _non-enzymatic dataset_, used to train a non-tryptic version of Casanovo, was created by selecting PSMs with a uniform distribution of amino acids at the C-terminal peptide positions from two datasets: MassIVE-KB and PROSPECT.
Training, validation, and test splits for the non-enzymatic dataset are available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C5KS6JG0W).
Training, validation, and test splits for the non-enzymatic dataset are available as annotated MGF files on MassIVE with [dataset identifier MSV000094014]](https://doi.org/doi:10.25345/C5KS6JG0W).

**How do I know which model to use after training Casanovo?**

Expand Down
64 changes: 29 additions & 35 deletions docs/images/configure-help.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit c6a455b

Please sign in to comment.