Prepare release v4.2.0 (#331)

* Remove `train_from_scratch` config option (#275) Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`. Fixes #263. * Stabilize torch.topk() behavior (#290) * Add epsilon to index zero * Fix typo * Use base PyTorch for repeating along the vocabulary size * Combine masking steps * Lint with updated black version * Lint test files * Add topk unit test * Fix lint * Add fixme comment for future * Update changelog * Generate new screengrabs with rich-codex --------- Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Rename max_iters to cosine_schedule_period_iters (#300) * Rename max_iters to cosine_schedule_period_iters * Add deprecated config option unit test * Fix missed rename * Proper linting * Remove unnecessary logging * Test that checkpoints with deprecated config options can be loaded * Minor change * Add test for fine-tuning with deprecated config options * Remove deprecated hyperparameters during model loading * Include deprecated hyperparameter warning * Test whether the warning is issued * Verify that the deprecated option is removed * Fix comments * Avoid defining deprecated options twice * Remap previous renamed config option `every_n_train_steps` * Update changelog --------- Co-authored-by: melihyilmaz <[email protected]> * Add FAQ entry about antibody sequencing * Don't crash when multiple beams have identical peptide scores (#306) * Test different beams with identical scores * Randomly break ties for beams with identical peptide score * Update changelog * Don't remove unit test * Allow csv to handle all newlines (#316) * Add 9-species model weights link to FAQ (#303) * Add model weights link * Generate new screengrabs with rich-codex * Clarify that these weights should only be used for benchmarking --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <[email protected]> * Add FAQ entry about antibody sequencing (#304) * Add FAQ entry about antibody sequencing * Generate new screengrabs with rich-codex --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Melih Yilmaz <[email protected]> * Allow csv to handle all newlines The `csv` module tries to handle newlines itself. On Windows, this leads to line endings of `\r\r\n` instead of `\r\n`. Setting `newline=''` produces the intended output on both platforms. * Update CHANGELOG.md * Fix linting issue * Delete docs/images/help.svg --------- Co-authored-by: Melih Yilmaz <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Stafford Noble <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> * Don't test on macOS versions with MPS (#327) * Prepare for release v4.2.0 * Update CHANGELOG.md (#332) --------- Co-authored-by: Melih Yilmaz <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: melihyilmaz <[email protected]> Co-authored-by: wsnoble <[email protected]> Co-authored-by: Joshua Klein <[email protected]>
Noble-Lab · May 14, 2024 · 1fcec6a · 1fcec6a
1 parent 6dc301c
commit 1fcec6a
Show file tree

Hide file tree

Showing 11 changed files with 231 additions and 97 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -18,7 +18,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
+        os: [ubuntu-latest, windows-latest, macos-13]
 
     steps:
     - uses: actions/checkout@v4

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 ## [Unreleased]
 
+## [4.2.0] - 2024-05-14
+
+### Added
+
+- A deprecation warning will be issued when deprecated config options are used in the config file or in the model weights file.
+
+### Changed
+
+- Config option `max_iters` has been renamed to `cosine_schedule_period_iters` to better reflect that it controls the number of iterations for the cosine half period of the learning rate.
+
+### Fixed
+
+- Fix beam search caching failure when multiple beams have an equal predicted peptide score by breaking ties randomly.
+- The mzTab output file now has proper line endings regardless of platform, fixing the extra `\r` found when run on Windows.
+
 ## [4.1.0] - 2024-02-16
 
 ### Changed
@@ -233,7 +248,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 - Initial Casanovo version.
 
-[Unreleased]: https://github.com/Noble-Lab/casanovo/compare/v4.1.0...HEAD
+[Unreleased]: https://github.com/Noble-Lab/casanovo/compare/v4.2.0...HEAD
+[4.2.0]: https://github.com/Noble-Lab/casanovo/compare/v4.1.0...v4.2.0
 [4.1.0]: https://github.com/Noble-Lab/casanovo/compare/v4.0.1...v4.1.0
 [4.0.1]: https://github.com/Noble-Lab/casanovo/compare/v4.0.0...v4.0.1
 [4.0.0]: https://github.com/Noble-Lab/casanovo/compare/v3.5.0...v4.0.0

diff --git a/casanovo/config.py b/casanovo/config.py
@@ -2,6 +2,7 @@
 
 import logging
 import shutil
+import warnings
 from pathlib import Path
 from typing import Optional, Dict, Callable, Tuple, Union
 
@@ -12,6 +13,14 @@
 logger = logging.getLogger("casanovo")
 
 
+# FIXME: This contains deprecated config options to be removed in the next major
+#  version update.
+_config_deprecated = dict(
+    every_n_train_steps="val_check_interval",
+    max_iters="cosine_schedule_period_iters",
+)
+
+
 class Config:
     """The Casanovo configuration options.
 
@@ -56,7 +65,7 @@ class Config:
         tb_summarywriter=str,
         train_label_smoothing=float,
         warmup_iters=int,
-        max_iters=int,
+        cosine_schedule_period_iters=int,
         learning_rate=float,
         weight_decay=float,
         train_batch_size=int,
@@ -84,6 +93,15 @@ def __init__(self, config_file: Optional[str] = None):
         else:
             with Path(config_file).open() as f_in:
                 self._user_config = yaml.safe_load(f_in)
+                # Remap deprecated config entries.
+                for old, new in _config_deprecated.items():
+                    if old in self._user_config:
+                        self._user_config[new] = self._user_config.pop(old)
+                        warnings.warn(
+                            f"Deprecated config option '{old}' remapped to "
+                            f"'{new}'",
+                            DeprecationWarning,
+                        )
                 # Check for missing entries in config file.
                 config_missing = self._params.keys() - self._user_config.keys()
                 if len(config_missing) > 0:

diff --git a/casanovo/config.yaml b/casanovo/config.yaml
@@ -4,103 +4,102 @@
 ###
 
 ###
-# The following parameters can be modified when running inference or
-# when fine-tuning an existing Casanovo model.
+# The following parameters can be modified when running inference or when
+# fine-tuning an existing Casanovo model.
 ###
 
-# Max absolute difference allowed with respect to observed precursor m/z
+# Max absolute difference allowed with respect to observed precursor m/z.
 # Predictions outside the tolerance range are assigned a negative peptide score.
 precursor_mass_tol: 50  # ppm
-# Isotopes to consider when comparing predicted and observed precursor m/z's
+# Isotopes to consider when comparing predicted and observed precursor m/z's.
 isotope_error_range: [0, 1]
-# The minimum length of predicted peptides
+# The minimum length of predicted peptides.
 min_peptide_len: 6
-# Number of spectra in one inference batch
+# Number of spectra in one inference batch.
 predict_batch_size: 1024
-# Number of beams used in beam search
+# Number of beams used in beam search.
 n_beams: 1
-# Number of PSMs for each spectrum
+# Number of PSMs for each spectrum.
 top_match: 1
 # The hardware accelerator to use. Must be one of:
-# "cpu", "gpu", "tpu", "ipu", "hpu", "mps", or "auto"
+# "cpu", "gpu", "tpu", "ipu", "hpu", "mps", or "auto".
 accelerator: "auto"
-# The devices to use. Can be set to a positive number int,
-# or the value -1 to indicate all available devices should be used,
-# If left empty, the appropriate number will be automatically
-# selected for automatic selected on the chosen accelerator.
+# The devices to use. Can be set to a positive number int, or the value -1 to
+# indicate all available devices should be used. If left empty, the appropriate
+# number will be automatically selected for based on the chosen accelerator.
 devices:
 
 ###
 # The following parameters should only be modified if you are training a new
 # Casanovo model from scratch.
 ###
 
-# Random seed to ensure reproducible results
+# Random seed to ensure reproducible results.
 random_seed: 454
 
 # OUTPUT OPTIONS
-# Logging frequency in training steps
+# Logging frequency in training steps.
 n_log: 1
-# Tensorboard directory to use for keeping track of training metrics
+# Tensorboard directory to use for keeping track of training metrics.
 tb_summarywriter:
-# Save the top k model checkpoints during training. -1 saves all, and
-# leaving this field empty saves none.
+# Save the top k model checkpoints during training. -1 saves all, and leaving
+# this field empty saves none.
 save_top_k: 5
-# Path to saved checkpoints
+# Path to saved checkpoints.
 model_save_folder_path: ""
-# Model validation and checkpointing frequency in training steps
+# Model validation and checkpointing frequency in training steps.
 val_check_interval: 50_000
 
 # SPECTRUM PROCESSING OPTIONS
-# Number of the most intense peaks to retain, any remaining peaks are discarded
+# Number of the most intense peaks to retain, any remaining peaks are discarded.
 n_peaks: 150
-# Min peak m/z allowed, peaks with smaller m/z are discarded
+# Min peak m/z allowed, peaks with smaller m/z are discarded.
 min_mz: 50.0
-# Max peak m/z allowed, peaks with larger m/z are discarded
+# Max peak m/z allowed, peaks with larger m/z are discarded.
 max_mz: 2500.0
-# Min peak intensity allowed, less intense peaks are discarded
+# Min peak intensity allowed, less intense peaks are discarded.
 min_intensity: 0.01
-# Max absolute m/z difference allowed when removing the precursor peak
+# Max absolute m/z difference allowed when removing the precursor peak.
 remove_precursor_tol: 2.0  # Da
-# Max precursor charge allowed, spectra with larger charge are skipped
+# Max precursor charge allowed, spectra with larger charge are skipped.
 max_charge: 10
 
 # MODEL ARCHITECTURE OPTIONS
-# Dimensionality of latent representations, i.e. peak embeddings
+# Dimensionality of latent representations, i.e. peak embeddings.
 dim_model: 512
-# Number of attention heads
+# Number of attention heads.
 n_head: 8
-# Dimensionality of fully connected layers
+# Dimensionality of fully connected layers.
 dim_feedforward: 1024
-# Number of transformer layers in spectrum encoder and peptide decoder
+# Number of transformer layers in spectrum encoder and peptide decoder.
 n_layers: 9
-# Dropout rate for model weights
+# Dropout rate for model weights.
 dropout: 0.0
-# Number of dimensions to use for encoding peak intensity
-# Projected up to ``dim_model`` by default and summed with the peak m/z encoding
+# Number of dimensions to use for encoding peak intensity.
+# Projected up to `dim_model` by default and summed with the peak m/z encoding.
 dim_intensity:
-# Max decoded peptide length
+# Max decoded peptide length.
 max_length: 100
-# Number of warmup iterations for learning rate scheduler
+# The number of iterations for the linear warm-up of the learning rate.
 warmup_iters: 100_000
-# Max number of iterations for learning rate scheduler
-max_iters: 600_000
-# Learning rate for weight updates during training
+# The number of iterations for the cosine half period of the learning rate.
+cosine_schedule_period_iters: 600_000
+# Learning rate for weight updates during training.
 learning_rate: 5e-4
-# Regularization term for weight updates
+# Regularization term for weight updates.
 weight_decay: 1e-5
-# Amount of label smoothing when computing the training loss
+# Amount of label smoothing when computing the training loss.
 train_label_smoothing: 0.01
 
 # TRAINING/INFERENCE OPTIONS
-# Number of spectra in one training batch
+# Number of spectra in one training batch.
 train_batch_size: 32
-# Max number of training epochs
+# Max number of training epochs.
 max_epochs: 30
-# Number of validation steps to run before training begins
+# Number of validation steps to run before training begins.
 num_sanity_val_steps: 0
-# Calculate peptide and amino acid precision during training. this
-# is expensive, so we recommend against it.
+# Calculate peptide and amino acid precision during training.
+# This is expensive, so we recommend against it.
 calculate_precision: False
 
 # AMINO ACID AND MODIFICATION VOCABULARY

diff --git a/casanovo/data/ms_io.py b/casanovo/data/ms_io.py
@@ -147,7 +147,7 @@ def save(self) -> None:
         """
         Export the spectrum identifications to the mzTab file.
         """
-        with open(self.filename, "w") as f:
+        with open(self.filename, "w", newline="") as f:
             writer = csv.writer(f, delimiter="\t", lineterminator=os.linesep)
             # Write metadata.
             for row in self.metadata: