Skip to content

Commit

Permalink
Prepare release v4.2.0 (#331)
Browse files Browse the repository at this point in the history
* Remove `train_from_scratch` config option (#275)

Instead of having to specify `train_from_scratch` in the config file, training will proceed from an existing model weights file if this is given as an argument to `casanovo train`.

Fixes #263.

* Stabilize torch.topk() behavior (#290)

* Add epsilon to index zero

* Fix typo

* Use base PyTorch for repeating along the vocabulary size

* Combine masking steps

* Lint with updated black version

* Lint test files

* Add topk unit test

* Fix lint

* Add fixme comment for future

* Update changelog

* Generate new screengrabs with rich-codex

---------

Co-authored-by: Wout Bittremieux <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Rename max_iters to cosine_schedule_period_iters (#300)

* Rename max_iters to cosine_schedule_period_iters

* Add deprecated config option unit test

* Fix missed rename

* Proper linting

* Remove unnecessary logging

* Test that checkpoints with deprecated config options can be loaded

* Minor change

* Add test for fine-tuning with deprecated config options

* Remove deprecated hyperparameters during model loading

* Include deprecated hyperparameter warning

* Test whether the warning is issued

* Verify that the deprecated option is removed

* Fix comments

* Avoid defining deprecated options twice

* Remap previous renamed config option `every_n_train_steps`

* Update changelog

---------

Co-authored-by: melihyilmaz <[email protected]>

* Add FAQ entry about antibody sequencing

* Don't crash when multiple beams have identical peptide scores (#306)

* Test different beams with identical scores

* Randomly break ties for beams with identical peptide score

* Update changelog

* Don't remove unit test

* Allow csv to handle all newlines (#316)

* Add 9-species model weights link to FAQ (#303)

* Add model weights link

* Generate new screengrabs with rich-codex

* Clarify that these weights should only be used for benchmarking

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wout Bittremieux <[email protected]>

* Add FAQ entry about antibody sequencing (#304)

* Add FAQ entry about antibody sequencing

* Generate new screengrabs with rich-codex

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Melih Yilmaz <[email protected]>

* Allow csv to handle all newlines

The `csv` module tries to handle newlines itself. On Windows, this leads to line endings of `\r\r\n` instead of `\r\n`.

Setting `newline=''` produces the intended output on both platforms.

* Update CHANGELOG.md

* Fix linting issue

* Delete docs/images/help.svg

---------

Co-authored-by: Melih Yilmaz <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wout Bittremieux <[email protected]>
Co-authored-by: William Stafford Noble <[email protected]>
Co-authored-by: Wout Bittremieux <[email protected]>

* Don't test on macOS versions with MPS (#327)

* Prepare for release v4.2.0

* Update CHANGELOG.md (#332)

---------

Co-authored-by: Melih Yilmaz <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: melihyilmaz <[email protected]>
Co-authored-by: wsnoble <[email protected]>
Co-authored-by: Joshua Klein <[email protected]>
  • Loading branch information
6 people authored May 14, 2024
1 parent 6dc301c commit 1fcec6a
Show file tree
Hide file tree
Showing 11 changed files with 231 additions and 97 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
os: [ubuntu-latest, windows-latest, macos-13]

steps:
- uses: actions/checkout@v4
Expand Down
18 changes: 17 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased]

## [4.2.0] - 2024-05-14

### Added

- A deprecation warning will be issued when deprecated config options are used in the config file or in the model weights file.

### Changed

- Config option `max_iters` has been renamed to `cosine_schedule_period_iters` to better reflect that it controls the number of iterations for the cosine half period of the learning rate.

### Fixed

- Fix beam search caching failure when multiple beams have an equal predicted peptide score by breaking ties randomly.
- The mzTab output file now has proper line endings regardless of platform, fixing the extra `\r` found when run on Windows.

## [4.1.0] - 2024-02-16

### Changed
Expand Down Expand Up @@ -233,7 +248,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

- Initial Casanovo version.

[Unreleased]: https://github.com/Noble-Lab/casanovo/compare/v4.1.0...HEAD
[Unreleased]: https://github.com/Noble-Lab/casanovo/compare/v4.2.0...HEAD
[4.2.0]: https://github.com/Noble-Lab/casanovo/compare/v4.1.0...v4.2.0
[4.1.0]: https://github.com/Noble-Lab/casanovo/compare/v4.0.1...v4.1.0
[4.0.1]: https://github.com/Noble-Lab/casanovo/compare/v4.0.0...v4.0.1
[4.0.0]: https://github.com/Noble-Lab/casanovo/compare/v3.5.0...v4.0.0
Expand Down
20 changes: 19 additions & 1 deletion casanovo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import logging
import shutil
import warnings
from pathlib import Path
from typing import Optional, Dict, Callable, Tuple, Union

Expand All @@ -12,6 +13,14 @@
logger = logging.getLogger("casanovo")


# FIXME: This contains deprecated config options to be removed in the next major
# version update.
_config_deprecated = dict(
every_n_train_steps="val_check_interval",
max_iters="cosine_schedule_period_iters",
)


class Config:
"""The Casanovo configuration options.
Expand Down Expand Up @@ -56,7 +65,7 @@ class Config:
tb_summarywriter=str,
train_label_smoothing=float,
warmup_iters=int,
max_iters=int,
cosine_schedule_period_iters=int,
learning_rate=float,
weight_decay=float,
train_batch_size=int,
Expand Down Expand Up @@ -84,6 +93,15 @@ def __init__(self, config_file: Optional[str] = None):
else:
with Path(config_file).open() as f_in:
self._user_config = yaml.safe_load(f_in)
# Remap deprecated config entries.
for old, new in _config_deprecated.items():
if old in self._user_config:
self._user_config[new] = self._user_config.pop(old)
warnings.warn(
f"Deprecated config option '{old}' remapped to "
f"'{new}'",
DeprecationWarning,
)
# Check for missing entries in config file.
config_missing = self._params.keys() - self._user_config.keys()
if len(config_missing) > 0:
Expand Down
89 changes: 44 additions & 45 deletions casanovo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,103 +4,102 @@
###

###
# The following parameters can be modified when running inference or
# when fine-tuning an existing Casanovo model.
# The following parameters can be modified when running inference or when
# fine-tuning an existing Casanovo model.
###

# Max absolute difference allowed with respect to observed precursor m/z
# Max absolute difference allowed with respect to observed precursor m/z.
# Predictions outside the tolerance range are assigned a negative peptide score.
precursor_mass_tol: 50 # ppm
# Isotopes to consider when comparing predicted and observed precursor m/z's
# Isotopes to consider when comparing predicted and observed precursor m/z's.
isotope_error_range: [0, 1]
# The minimum length of predicted peptides
# The minimum length of predicted peptides.
min_peptide_len: 6
# Number of spectra in one inference batch
# Number of spectra in one inference batch.
predict_batch_size: 1024
# Number of beams used in beam search
# Number of beams used in beam search.
n_beams: 1
# Number of PSMs for each spectrum
# Number of PSMs for each spectrum.
top_match: 1
# The hardware accelerator to use. Must be one of:
# "cpu", "gpu", "tpu", "ipu", "hpu", "mps", or "auto"
# "cpu", "gpu", "tpu", "ipu", "hpu", "mps", or "auto".
accelerator: "auto"
# The devices to use. Can be set to a positive number int,
# or the value -1 to indicate all available devices should be used,
# If left empty, the appropriate number will be automatically
# selected for automatic selected on the chosen accelerator.
# The devices to use. Can be set to a positive number int, or the value -1 to
# indicate all available devices should be used. If left empty, the appropriate
# number will be automatically selected for based on the chosen accelerator.
devices:

###
# The following parameters should only be modified if you are training a new
# Casanovo model from scratch.
###

# Random seed to ensure reproducible results
# Random seed to ensure reproducible results.
random_seed: 454

# OUTPUT OPTIONS
# Logging frequency in training steps
# Logging frequency in training steps.
n_log: 1
# Tensorboard directory to use for keeping track of training metrics
# Tensorboard directory to use for keeping track of training metrics.
tb_summarywriter:
# Save the top k model checkpoints during training. -1 saves all, and
# leaving this field empty saves none.
# Save the top k model checkpoints during training. -1 saves all, and leaving
# this field empty saves none.
save_top_k: 5
# Path to saved checkpoints
# Path to saved checkpoints.
model_save_folder_path: ""
# Model validation and checkpointing frequency in training steps
# Model validation and checkpointing frequency in training steps.
val_check_interval: 50_000

# SPECTRUM PROCESSING OPTIONS
# Number of the most intense peaks to retain, any remaining peaks are discarded
# Number of the most intense peaks to retain, any remaining peaks are discarded.
n_peaks: 150
# Min peak m/z allowed, peaks with smaller m/z are discarded
# Min peak m/z allowed, peaks with smaller m/z are discarded.
min_mz: 50.0
# Max peak m/z allowed, peaks with larger m/z are discarded
# Max peak m/z allowed, peaks with larger m/z are discarded.
max_mz: 2500.0
# Min peak intensity allowed, less intense peaks are discarded
# Min peak intensity allowed, less intense peaks are discarded.
min_intensity: 0.01
# Max absolute m/z difference allowed when removing the precursor peak
# Max absolute m/z difference allowed when removing the precursor peak.
remove_precursor_tol: 2.0 # Da
# Max precursor charge allowed, spectra with larger charge are skipped
# Max precursor charge allowed, spectra with larger charge are skipped.
max_charge: 10

# MODEL ARCHITECTURE OPTIONS
# Dimensionality of latent representations, i.e. peak embeddings
# Dimensionality of latent representations, i.e. peak embeddings.
dim_model: 512
# Number of attention heads
# Number of attention heads.
n_head: 8
# Dimensionality of fully connected layers
# Dimensionality of fully connected layers.
dim_feedforward: 1024
# Number of transformer layers in spectrum encoder and peptide decoder
# Number of transformer layers in spectrum encoder and peptide decoder.
n_layers: 9
# Dropout rate for model weights
# Dropout rate for model weights.
dropout: 0.0
# Number of dimensions to use for encoding peak intensity
# Projected up to ``dim_model`` by default and summed with the peak m/z encoding
# Number of dimensions to use for encoding peak intensity.
# Projected up to `dim_model` by default and summed with the peak m/z encoding.
dim_intensity:
# Max decoded peptide length
# Max decoded peptide length.
max_length: 100
# Number of warmup iterations for learning rate scheduler
# The number of iterations for the linear warm-up of the learning rate.
warmup_iters: 100_000
# Max number of iterations for learning rate scheduler
max_iters: 600_000
# Learning rate for weight updates during training
# The number of iterations for the cosine half period of the learning rate.
cosine_schedule_period_iters: 600_000
# Learning rate for weight updates during training.
learning_rate: 5e-4
# Regularization term for weight updates
# Regularization term for weight updates.
weight_decay: 1e-5
# Amount of label smoothing when computing the training loss
# Amount of label smoothing when computing the training loss.
train_label_smoothing: 0.01

# TRAINING/INFERENCE OPTIONS
# Number of spectra in one training batch
# Number of spectra in one training batch.
train_batch_size: 32
# Max number of training epochs
# Max number of training epochs.
max_epochs: 30
# Number of validation steps to run before training begins
# Number of validation steps to run before training begins.
num_sanity_val_steps: 0
# Calculate peptide and amino acid precision during training. this
# is expensive, so we recommend against it.
# Calculate peptide and amino acid precision during training.
# This is expensive, so we recommend against it.
calculate_precision: False

# AMINO ACID AND MODIFICATION VOCABULARY
Expand Down
2 changes: 1 addition & 1 deletion casanovo/data/ms_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ def save(self) -> None:
"""
Export the spectrum identifications to the mzTab file.
"""
with open(self.filename, "w") as f:
with open(self.filename, "w", newline="") as f:
writer = csv.writer(f, delimiter="\t", lineterminator=os.linesep)
# Write metadata.
for row in self.metadata:
Expand Down
Loading

0 comments on commit 1fcec6a

Please sign in to comment.