Eliminate evaluate Command (#359)

* prediction output in model eval mode * eliminate eval command, introduce -e flag for predict command * adapted unit test to new model runner and model functionality * updated documentation * removed log and result files * Generate new screengrabs with rich-codex * Update paper reference (#361) * Bug report template (#360) * bug report template * punctuation, hardware description item * Restrict NumPy to pre-2.0 (#344) * Restrict NumPy to pre-2.0 * Update changelog * Update paper reference (#361) --------- Co-authored-by: Lilferrit <[email protected]> * upgrade codecove to v4 (#364) * implemen eval mode at model runner level, fix unit test * CLI documentation * Generate new screengrabs with rich-codex * requested changes * Generate new screengrabs with rich-codex * evaluation test cases * file warnings, evaluation tests * fixed ubuntu specific test case bug * verify annotated mgf files * verify annotated mgf files * Generate new screengrabs with rich-codex * Save best model (#365) * save best model * save best model * updated unit tests * remove save top k config item * added save_top_k to deprecated config options * changelog entry * test case, formatting * requested changes * prediction output in model eval mode * eliminate eval command, introduce -e flag for predict command * adapted unit test to new model runner and model functionality * updated documentation * removed log and result files * implemen eval mode at model runner level, fix unit test * CLI documentation * Bug report template (#360) * bug report template * punctuation, hardware description item * Restrict NumPy to pre-2.0 (#344) * Restrict NumPy to pre-2.0 * Update changelog * Update paper reference (#361) --------- Co-authored-by: Lilferrit <[email protected]> * requested changes * evaluation test cases * file warnings, evaluation tests * fixed ubuntu specific test case bug * verify annotated mgf files * AnnotatedSpectrumIndex type error * requested changes, changelog entry --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <[email protected]>
Noble-Lab · Aug 21, 2024 · 67939b8 · 67939b8
1 parent ba58668
commit 67939b8
Show file tree

Hide file tree

Showing 11 changed files with 504 additions and 347 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report_template.md b/.github/ISSUE_TEMPLATE/bug_report_template.md
@@ -0,0 +1,55 @@
+---
+name: Bug Report
+about: Submit a Casanovo Bug Report
+labels: bug
+---
+
+## Describe the Issue
+A clear and concise description of what the issue/bug is.
+
+## Steps To Reproduce
+Steps to reproduce the incorrect behavior.
+
+## Expected Behavior
+A clear and concise description of what you expected to happen.
+
+## Terminal Output (If Applicable)
+Provide any applicable console output in between the tick marks below.
+
+```
+
+```
+
+## Environment:
+- OS: [e.g. Windows 11, Windows 10, macOS 14, Ubuntu 24.04]
+- Casanovo Version: [e.g. 4.2.1]
+- Hardware Used (CPU or GPU, if GPU also GPU model and CUDA version): [e.g. GPU: NVIDIA GeForce RTX 2070, CUDA Version: 12.5]
+
+### Checking GPU Version
+
+The GPU model can be checked by typing `nvidia-smi` into a terminal/console window.
+An example of how to use this command is shown below.
+In this case, the CUDA version is 12.5 and the GPU model is GeForce RTX 2070.
+
+
+```
+(casanovo_env) C:\Users\<user>\OneDrive\Documents\casanovo>nvidia-smi
+Fri Aug  2 12:34:57 2024       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA GeForce RTX 2070 ...  WDDM  |   00000000:01:00.0  On |                  N/A |
+| N/A   60C    P8             16W /   90W |    1059MiB /   8192MiB |      0%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+```
+
+## Additional Context
+Add any other context about the problem here.
+
+## Attach Files
+Please attach all input files used and the full Casanovo log file.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - During training, model checkpoints will be saved at the end of each training epoch in addition to the checkpoints saved at the end of every validation run.
 - Besides as a local file, model weights can be specified from a URL. Upon initial download, the weights file is cached for future re-use.
 
+### Changed
+
+- Removed the `evaluate` sub-command, and all model evaluation functionality has been moved to the `sequence` command using the new `--evaluate` flag.
+
 ### Fixed
 
 - Precursor charges are exported as integers instead of floats in the mzTab output file, in compliance with the mzTab specification.

diff --git a/casanovo/casanovo.py b/casanovo/casanovo.py
@@ -128,64 +128,50 @@ def main() -> None:
     nargs=-1,
     type=click.Path(exists=True, dir_okay=False),
 )
+@click.option(
+    "--evaluate",
+    "-e",
+    is_flag=True,
+    default=False,
+    help="""
+    Run in evaluation mode. When this flag is set the peptide and amino
+    acid precision will be calculated and logged at the end of the sequencing
+    run. All input files must be annotated MGF files if running in evaluation
+    mode.
+    """,
+)
 def sequence(
     peak_path: Tuple[str],
     model: Optional[str],
     config: Optional[str],
     output: Optional[str],
     verbosity: str,
+    evaluate: bool,
 ) -> None:
     """De novo sequence peptides from tandem mass spectra.
 
-    PEAK_PATH must be one or more mzMl, mzXML, or MGF files from which
-    to sequence peptides.
+    PEAK_PATH must be one or more mzML, mzXML, or MGF files from which
+    to sequence peptides. If evaluate is set to True PEAK_PATH must be
+    one or more annotated MGF file.
     """
     output = setup_logging(output, verbosity)
     config, model = setup_model(model, config, output, False)
     start_time = time.time()
     with ModelRunner(config, model) as runner:
-        logger.info("Sequencing peptides from:")
+        logger.info(
+            "Sequencing %speptides from:",
+            "and evaluating " if evaluate else "",
+        )
         for peak_file in peak_path:
             logger.info("  %s", peak_file)
 
-        runner.predict(peak_path, output)
+        runner.predict(peak_path, output, evaluate=evaluate)
         psms = runner.writer.psms
         utils.log_sequencing_report(
             psms, start_time=start_time, end_time=time.time()
         )
 
 
-@main.command(cls=_SharedParams)
-@click.argument(
-    "annotated_peak_path",
-    required=True,
-    nargs=-1,
-    type=click.Path(exists=True, dir_okay=False),
-)
-def evaluate(
-    annotated_peak_path: Tuple[str],
-    model: Optional[str],
-    config: Optional[str],
-    output: Optional[str],
-    verbosity: str,
-) -> None:
-    """Evaluate de novo peptide sequencing performance.
-
-    ANNOTATED_PEAK_PATH must be one or more annoated MGF files,
-    such as those provided by MassIVE-KB.
-    """
-    output = setup_logging(output, verbosity)
-    config, model = setup_model(model, config, output, False)
-    start_time = time.time()
-    with ModelRunner(config, model) as runner:
-        logger.info("Sequencing and evaluating peptides from:")
-        for peak_file in annotated_peak_path:
-            logger.info("  %s", peak_file)
-
-        runner.evaluate(annotated_peak_path)
-        utils.log_run_report(start_time=start_time, end_time=time.time())
-
-
 @main.command(cls=_SharedParams)
 @click.argument(
     "train_peak_path",

diff --git a/casanovo/data/datasets.py b/casanovo/data/datasets.py
@@ -83,7 +83,9 @@ def __getitem__(
             The unique spectrum identifier, formed by its original peak file and
             identifier (index or scan number) therein.
         """
-        mz_array, int_array, precursor_mz, precursor_charge = self.index[idx]
+        mz_array, int_array, precursor_mz, precursor_charge = self.index[idx][
+            :4
+        ]
         spectrum = self._process_peaks(
             mz_array, int_array, precursor_mz, precursor_charge
         )

diff --git a/casanovo/denovo/model_runner.py b/casanovo/denovo/model_runner.py
@@ -10,6 +10,7 @@
 from pathlib import Path
 from typing import Iterable, List, Optional, Union
 
+import depthcharge.masses
 import lightning.pytorch as pl
 import numpy as np
 import torch
@@ -20,6 +21,7 @@
 from ..config import Config
 from ..data import ms_io
 from ..denovo.dataloaders import DeNovoDataModule
+from ..denovo.evaluate import aa_match_batch, aa_match_metrics
 from ..denovo.model import Spec2Pep
 
 
@@ -118,36 +120,52 @@ def train(
             self.loaders.val_dataloader(),
         )
 
-    def evaluate(self, peak_path: Iterable[str]) -> None:
-        """Evaluate peptide sequence preditions from a trained Casanovo model.
+    def log_metrics(self, test_index: AnnotatedSpectrumIndex) -> None:
+        """Log peptide precision and amino acid precision
+
+        Calculate and log peptide precision and amino acid precision
+        based off of model predictions and spectrum annotations
 
         Parameters
         ----------
-        peak_path : iterable of str
-            The path with MS data files for predicting peptide sequences.
-
-        Returns
-        -------
-        self
+        test_index : AnnotatedSpectrumIndex
+            Index containing the annotated spectra used to generate model
+            predictions
         """
-        self.initialize_trainer(train=False)
-        self.initialize_model(train=False)
-
-        test_index = self._get_index(peak_path, True, "evaluation")
-        self.initialize_data_module(test_index=test_index)
-        self.loaders.setup(stage="test", annotated=True)
+        model_output = [psm[0] for psm in self.writer.psms]
+        spectrum_annotations = [
+            test_index[i][4] for i in range(test_index.n_spectra)
+        ]
+        aa_precision, _, pep_precision = aa_match_metrics(
+            *aa_match_batch(
+                spectrum_annotations,
+                model_output,
+                depthcharge.masses.PeptideMass().masses,
+            )
+        )
 
-        self.trainer.validate(self.model, self.loaders.test_dataloader())
+        logger.info("Peptide Precision: %.2f%%", 100 * pep_precision)
+        logger.info("Amino Acid Precision: %.2f%%", 100 * aa_precision)
 
-    def predict(self, peak_path: Iterable[str], output: str) -> None:
+    def predict(
+        self, peak_path: Iterable[str], output: str, evaluate: bool = False
+    ) -> None:
         """Predict peptide sequences with a trained Casanovo model.
 
+        Can also evaluate model during prediction if provided with annotated
+        peak files.
+
         Parameters
         ----------
         peak_path : iterable of str
             The path with the MS data files for predicting peptide sequences.
         output : str
             Where should the output be saved?
+        evaluate: bool
+            whether to run model evaluation in addition to inference
+            Note: peak_path most point to annotated MS data files when
+            running model evaluation. Files that are not an annotated
+            peak file format will be ignored if evaluate is set to true.
 
         Returns
         -------
@@ -164,12 +182,15 @@ def predict(self, peak_path: Iterable[str], output: str) -> None:
         self.initialize_model(train=False)
         self.model.out_writer = self.writer
 
-        test_index = self._get_index(peak_path, False, "")
+        test_index = self._get_index(peak_path, evaluate, "")
         self.writer.set_ms_run(test_index.ms_files)
         self.initialize_data_module(test_index=test_index)
         self.loaders.setup(stage="test", annotated=False)
         self.trainer.predict(self.model, self.loaders.test_dataloader())
 
+        if evaluate:
+            self.log_metrics(test_index)
+
     def initialize_trainer(self, train: bool) -> None:
         """Initialize the lightning Trainer.
 
@@ -398,7 +419,22 @@ def _get_index(
 
         Index = AnnotatedSpectrumIndex if annotated else SpectrumIndex
         valid_charge = np.arange(1, self.config.max_charge + 1)
-        return Index(index_fname, filenames, valid_charge=valid_charge)
+
+        try:
+            return Index(index_fname, filenames, valid_charge=valid_charge)
+        except TypeError as e:
+            if Index == AnnotatedSpectrumIndex:
+                error_msg = (
+                    "Error creating annotated spectrum index. "
+                    "This may be the result of having an unannotated MGF file "
+                    "present in the validation peak file path list.\n"
+                    f"Original error message: {e}"
+                )
+
+                logger.error(error_msg)
+                raise TypeError(error_msg)
+
+            raise e
 
     def _get_strategy(self) -> Union[str, DDPStrategy]:
         """Get the strategy for the Trainer.
@@ -451,5 +487,15 @@ def _get_peak_filenames(
         for fname in glob.glob(path, recursive=True):
             if Path(fname).suffix.lower() in supported_ext:
                 found_files.add(fname)
+            else:
+                warnings.warn(
+                    f"Ignoring unsupported peak file: {fname}", RuntimeWarning
+                )
+
+    if len(found_files) == 0:
+        warnings.warn(
+            f"No supported peak files found under path(s): {list(paths)}",
+            RuntimeWarning,
+        )
 
     return sorted(list(found_files))