Skip to content

Commit

Permalink
Eliminate evaluate Command (#359)
Browse files Browse the repository at this point in the history
* prediction output in model eval mode

* eliminate eval command, introduce -e flag for predict command

* adapted unit test to new model runner and model functionality

* updated documentation

* removed log and result files

* Generate new screengrabs with rich-codex

* Update paper reference (#361)

* Bug report template (#360)

* bug report template

* punctuation, hardware description item

* Restrict NumPy to pre-2.0 (#344)

* Restrict NumPy to pre-2.0

* Update changelog

* Update paper reference (#361)

---------

Co-authored-by: Lilferrit <[email protected]>

* upgrade codecove to v4 (#364)

* implemen eval mode at model runner level, fix unit test

* CLI documentation

* Generate new screengrabs with rich-codex

* requested changes

* Generate new screengrabs with rich-codex

* evaluation test cases

* file warnings, evaluation tests

* fixed ubuntu specific test case bug

* verify annotated mgf files

* verify annotated mgf files

* Generate new screengrabs with rich-codex

* Save best model (#365)

* save best model

* save best model

* updated unit tests

* remove save top k config item

* added save_top_k to deprecated config options

* changelog entry

* test case, formatting

* requested changes

* prediction output in model eval mode

* eliminate eval command, introduce -e flag for predict command

* adapted unit test to new model runner and model functionality

* updated documentation

* removed log and result files

* implemen eval mode at model runner level, fix unit test

* CLI documentation

* Bug report template (#360)

* bug report template

* punctuation, hardware description item

* Restrict NumPy to pre-2.0 (#344)

* Restrict NumPy to pre-2.0

* Update changelog

* Update paper reference (#361)

---------

Co-authored-by: Lilferrit <[email protected]>

* requested changes

* evaluation test cases

* file warnings, evaluation tests

* fixed ubuntu specific test case bug

* verify annotated mgf files

* AnnotatedSpectrumIndex type error

* requested changes, changelog entry

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wout Bittremieux <[email protected]>
  • Loading branch information
3 people authored Aug 21, 2024
1 parent ba58668 commit 67939b8
Show file tree
Hide file tree
Showing 11 changed files with 504 additions and 347 deletions.
55 changes: 55 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
name: Bug Report
about: Submit a Casanovo Bug Report
labels: bug
---

## Describe the Issue
A clear and concise description of what the issue/bug is.

## Steps To Reproduce
Steps to reproduce the incorrect behavior.

## Expected Behavior
A clear and concise description of what you expected to happen.

## Terminal Output (If Applicable)
Provide any applicable console output in between the tick marks below.

```
```

## Environment:
- OS: [e.g. Windows 11, Windows 10, macOS 14, Ubuntu 24.04]
- Casanovo Version: [e.g. 4.2.1]
- Hardware Used (CPU or GPU, if GPU also GPU model and CUDA version): [e.g. GPU: NVIDIA GeForce RTX 2070, CUDA Version: 12.5]

### Checking GPU Version

The GPU model can be checked by typing `nvidia-smi` into a terminal/console window.
An example of how to use this command is shown below.
In this case, the CUDA version is 12.5 and the GPU model is GeForce RTX 2070.


```
(casanovo_env) C:\Users\<user>\OneDrive\Documents\casanovo>nvidia-smi
Fri Aug 2 12:34:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2070 ... WDDM | 00000000:01:00.0 On | N/A |
| N/A 60C P8 16W / 90W | 1059MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
```

## Additional Context
Add any other context about the problem here.

## Attach Files
Please attach all input files used and the full Casanovo log file.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- During training, model checkpoints will be saved at the end of each training epoch in addition to the checkpoints saved at the end of every validation run.
- Besides as a local file, model weights can be specified from a URL. Upon initial download, the weights file is cached for future re-use.

### Changed

- Removed the `evaluate` sub-command, and all model evaluation functionality has been moved to the `sequence` command using the new `--evaluate` flag.

### Fixed

- Precursor charges are exported as integers instead of floats in the mzTab output file, in compliance with the mzTab specification.
Expand Down
56 changes: 21 additions & 35 deletions casanovo/casanovo.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,64 +128,50 @@ def main() -> None:
nargs=-1,
type=click.Path(exists=True, dir_okay=False),
)
@click.option(
"--evaluate",
"-e",
is_flag=True,
default=False,
help="""
Run in evaluation mode. When this flag is set the peptide and amino
acid precision will be calculated and logged at the end of the sequencing
run. All input files must be annotated MGF files if running in evaluation
mode.
""",
)
def sequence(
peak_path: Tuple[str],
model: Optional[str],
config: Optional[str],
output: Optional[str],
verbosity: str,
evaluate: bool,
) -> None:
"""De novo sequence peptides from tandem mass spectra.
PEAK_PATH must be one or more mzMl, mzXML, or MGF files from which
to sequence peptides.
PEAK_PATH must be one or more mzML, mzXML, or MGF files from which
to sequence peptides. If evaluate is set to True PEAK_PATH must be
one or more annotated MGF file.
"""
output = setup_logging(output, verbosity)
config, model = setup_model(model, config, output, False)
start_time = time.time()
with ModelRunner(config, model) as runner:
logger.info("Sequencing peptides from:")
logger.info(
"Sequencing %speptides from:",
"and evaluating " if evaluate else "",
)
for peak_file in peak_path:
logger.info(" %s", peak_file)

runner.predict(peak_path, output)
runner.predict(peak_path, output, evaluate=evaluate)
psms = runner.writer.psms
utils.log_sequencing_report(
psms, start_time=start_time, end_time=time.time()
)


@main.command(cls=_SharedParams)
@click.argument(
"annotated_peak_path",
required=True,
nargs=-1,
type=click.Path(exists=True, dir_okay=False),
)
def evaluate(
annotated_peak_path: Tuple[str],
model: Optional[str],
config: Optional[str],
output: Optional[str],
verbosity: str,
) -> None:
"""Evaluate de novo peptide sequencing performance.
ANNOTATED_PEAK_PATH must be one or more annoated MGF files,
such as those provided by MassIVE-KB.
"""
output = setup_logging(output, verbosity)
config, model = setup_model(model, config, output, False)
start_time = time.time()
with ModelRunner(config, model) as runner:
logger.info("Sequencing and evaluating peptides from:")
for peak_file in annotated_peak_path:
logger.info(" %s", peak_file)

runner.evaluate(annotated_peak_path)
utils.log_run_report(start_time=start_time, end_time=time.time())


@main.command(cls=_SharedParams)
@click.argument(
"train_peak_path",
Expand Down
4 changes: 3 additions & 1 deletion casanovo/data/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,9 @@ def __getitem__(
The unique spectrum identifier, formed by its original peak file and
identifier (index or scan number) therein.
"""
mz_array, int_array, precursor_mz, precursor_charge = self.index[idx]
mz_array, int_array, precursor_mz, precursor_charge = self.index[idx][
:4
]
spectrum = self._process_peaks(
mz_array, int_array, precursor_mz, precursor_charge
)
Expand Down
82 changes: 64 additions & 18 deletions casanovo/denovo/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from pathlib import Path
from typing import Iterable, List, Optional, Union

import depthcharge.masses
import lightning.pytorch as pl
import numpy as np
import torch
Expand All @@ -20,6 +21,7 @@
from ..config import Config
from ..data import ms_io
from ..denovo.dataloaders import DeNovoDataModule
from ..denovo.evaluate import aa_match_batch, aa_match_metrics
from ..denovo.model import Spec2Pep


Expand Down Expand Up @@ -118,36 +120,52 @@ def train(
self.loaders.val_dataloader(),
)

def evaluate(self, peak_path: Iterable[str]) -> None:
"""Evaluate peptide sequence preditions from a trained Casanovo model.
def log_metrics(self, test_index: AnnotatedSpectrumIndex) -> None:
"""Log peptide precision and amino acid precision
Calculate and log peptide precision and amino acid precision
based off of model predictions and spectrum annotations
Parameters
----------
peak_path : iterable of str
The path with MS data files for predicting peptide sequences.
Returns
-------
self
test_index : AnnotatedSpectrumIndex
Index containing the annotated spectra used to generate model
predictions
"""
self.initialize_trainer(train=False)
self.initialize_model(train=False)

test_index = self._get_index(peak_path, True, "evaluation")
self.initialize_data_module(test_index=test_index)
self.loaders.setup(stage="test", annotated=True)
model_output = [psm[0] for psm in self.writer.psms]
spectrum_annotations = [
test_index[i][4] for i in range(test_index.n_spectra)
]
aa_precision, _, pep_precision = aa_match_metrics(
*aa_match_batch(
spectrum_annotations,
model_output,
depthcharge.masses.PeptideMass().masses,
)
)

self.trainer.validate(self.model, self.loaders.test_dataloader())
logger.info("Peptide Precision: %.2f%%", 100 * pep_precision)
logger.info("Amino Acid Precision: %.2f%%", 100 * aa_precision)

def predict(self, peak_path: Iterable[str], output: str) -> None:
def predict(
self, peak_path: Iterable[str], output: str, evaluate: bool = False
) -> None:
"""Predict peptide sequences with a trained Casanovo model.
Can also evaluate model during prediction if provided with annotated
peak files.
Parameters
----------
peak_path : iterable of str
The path with the MS data files for predicting peptide sequences.
output : str
Where should the output be saved?
evaluate: bool
whether to run model evaluation in addition to inference
Note: peak_path most point to annotated MS data files when
running model evaluation. Files that are not an annotated
peak file format will be ignored if evaluate is set to true.
Returns
-------
Expand All @@ -164,12 +182,15 @@ def predict(self, peak_path: Iterable[str], output: str) -> None:
self.initialize_model(train=False)
self.model.out_writer = self.writer

test_index = self._get_index(peak_path, False, "")
test_index = self._get_index(peak_path, evaluate, "")
self.writer.set_ms_run(test_index.ms_files)
self.initialize_data_module(test_index=test_index)
self.loaders.setup(stage="test", annotated=False)
self.trainer.predict(self.model, self.loaders.test_dataloader())

if evaluate:
self.log_metrics(test_index)

def initialize_trainer(self, train: bool) -> None:
"""Initialize the lightning Trainer.
Expand Down Expand Up @@ -398,7 +419,22 @@ def _get_index(

Index = AnnotatedSpectrumIndex if annotated else SpectrumIndex
valid_charge = np.arange(1, self.config.max_charge + 1)
return Index(index_fname, filenames, valid_charge=valid_charge)

try:
return Index(index_fname, filenames, valid_charge=valid_charge)
except TypeError as e:
if Index == AnnotatedSpectrumIndex:
error_msg = (
"Error creating annotated spectrum index. "
"This may be the result of having an unannotated MGF file "
"present in the validation peak file path list.\n"
f"Original error message: {e}"
)

logger.error(error_msg)
raise TypeError(error_msg)

raise e

def _get_strategy(self) -> Union[str, DDPStrategy]:
"""Get the strategy for the Trainer.
Expand Down Expand Up @@ -451,5 +487,15 @@ def _get_peak_filenames(
for fname in glob.glob(path, recursive=True):
if Path(fname).suffix.lower() in supported_ext:
found_files.add(fname)
else:
warnings.warn(
f"Ignoring unsupported peak file: {fname}", RuntimeWarning
)

if len(found_files) == 0:
warnings.warn(
f"No supported peak files found under path(s): {list(paths)}",
RuntimeWarning,
)

return sorted(list(found_files))
Loading

0 comments on commit 67939b8

Please sign in to comment.