-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: onnx models #10
base: dev
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request introduces significant enhancements to the text-to-speech (TTS) plugin for NOS voices, transitioning to a custom ONNX-based inference system. Key modifications include the creation of a new VITS ONNX inference module, updates to the Changes
Possibly related PRs
Poem
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (8)
ovos_tts_plugin_nos/__init__.py (4)
5-5
: Unused Import: RemoveDict
fromtyping
The import statement
from typing import Dict
addsDict
, but there is no usage ofDict
in the code. Removing unused imports can clean up the code and avoid confusion.Apply this change:
-from typing import Dict
28-29
: Redundant Comment: Remove or ExpandThe comment
# pre-download voices on init if needed
is redundant because the codeself.get_engine(self.voice)
is self-explanatory. If additional context is necessary, consider expanding the comment; otherwise, it can be removed.
105-111
: Rename Variabletts
for ClarityIn the
get_tts
method, the variabletts
is used to store the engine instance. For clarity, consider renaming it toengine
orinference_engine
to more accurately reflect its purpose.Apply this change:
- tts = self.get_engine(voice) - tts.synth(sentence, wav_file) + engine = self.get_engine(voice) + engine.synth(sentence, wav_file)
125-131
: Manage Memory Usage for Cached EnginesCaching
VitsOnnxInference
instances for each voice can lead to increased memory usage if many voices are supported. Consider:
- Implementing a cache eviction policy.
- Limiting the number of cached instances.
- Allowing for manual release of resources when an engine is no longer needed.
ovos_tts_plugin_nos/vits_onnx.py (4)
1-4
: Add Module-Level DocstringThe module lacks a docstring that describes its purpose. Adding a module-level docstring improves code readability and provides context for other developers.
Apply this change:
# minimal onnx inference extracted from coqui-tts """ This module provides ONNX-based text-to-speech inference using the VITS model. It includes classes for managing graphemes, tokenizing text, and performing inference. """
58-65
: Optimize_create_vocab
CallsThe
_create_vocab
method is called during initialization and every time a property setter is used. To optimize:
- Check if properties are being modified after initialization. If not, the multiple calls to
_create_vocab
are unnecessary.- If properties might change, consider lazy evaluation or tracking changes to rebuild the vocabulary only when necessary.
257-266
: Clarify Execution Provider Selection LogicThe current logic for setting execution providers can be simplified for clarity:
- Explicitly set providers based on the
cuda
flag.- Remove unnecessary ternary operations for better readability.
Apply this change:
if cuda: - providers = [ - ("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"}) - ] + providers = ["CUDAExecutionProvider"] else: providers = ["CPUExecutionProvider"]
290-300
: Enhance Text NormalizationThe
normalize_text
method performs basic text normalization. To improve speech synthesis quality, consider:
- Handling special characters and diacritics.
- Expanding numbers, dates, and abbreviations into words.
- Removing or substituting unsupported characters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ovos_tts_plugin_nos/__init__.py
(4 hunks)ovos_tts_plugin_nos/vits_onnx.py
(1 hunks)requirements.txt
(1 hunks)
🔇 Additional comments (4)
ovos_tts_plugin_nos/__init__.py (2)
17-19
: Consistent Naming Convention for ConstantsThe constants
CELTIA
andSABELA
are defined in uppercase, which follows the convention for constants. Ensure that all constants in the codebase use consistent naming for readability and maintainability.
110-111
:⚠️ Potential issueEnsure Thread Safety for
VOICE2ENGINE
DictionaryThe class variable
VOICE2ENGINE
is modified without synchronization, which can lead to race conditions in multi-threaded environments. To make it thread-safe:
- Use a threading lock when accessing or modifying
VOICE2ENGINE
.- Alternatively, use thread-safe data structures or move
VOICE2ENGINE
to an instance variable if appropriate.To verify this issue, review how
NosTTSPlugin
instances are created and used in multi-threaded contexts within the application.ovos_tts_plugin_nos/vits_onnx.py (1)
256-273
: Review the Stability of Minimal ONNX InferenceThe comment indicates that this is a minimal inference code extracted from Coqui TTS. Ensure that:
- The implementation is stable and handles all necessary edge cases.
- Adequate testing is performed to validate performance and correctness.
- Documentation is updated to reflect production readiness.
requirements.txt (1)
4-4
: Verify Necessity ofovos-tts-plugin-cotovia
With the shift to ONNX models, confirm if the dependency
ovos-tts-plugin-cotovia
is still required. Removing unnecessary dependencies simplifies maintenance and reduces potential conflicts.To verify, check if
CotoviaTTSPlugin
or related functionality is still utilized in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ovos_tts_plugin_nos/vits_onnx.py (2)
41-41
: Fix typo in class docstringThere appears to be an incomplete or misspelled word 'el' at line 41 in the
Graphemes
class docstring. This may cause confusion and should be corrected for clarity.
254-263
: Clarify the providers configuration for ONNX Runtime sessionIn the
VitsOnnxInference
class constructor (lines 254-263), theproviders
list is constructed using a conditional expression within a list. The current implementation may lead to readability issues.Consider refactoring the
providers
assignment for better clarity:- providers = [ - "CPUExecutionProvider" - if cuda is False - else ("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"}) - ] + if cuda is False: + providers = ["CPUExecutionProvider"] + else: + providers = [("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"})]This change improves readability and makes the code easier to maintain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ovos_tts_plugin_nos/vits_onnx.py
(1 hunks)
🔇 Additional comments (2)
ovos_tts_plugin_nos/vits_onnx.py (2)
16-17
:⚠️ Potential issueEnsure consistency between vocabulary order in documentation and implementation
The class docstring (lines 16-17) states that characters are ordered as
[PAD, EOS, BOS, BLANK, CHARACTERS, PUNCTUATIONS]
. However, in the_create_vocab
method (line 154), the order is[PAD, PUNCTUATIONS, CHARACTERS, BLANK]
, and theEOS
andBOS
tokens are not included in the vocabulary.This inconsistency may lead to incorrect token mappings and affect the tokenizer's behavior.
Consider updating the
_create_vocab
method to include theEOS
andBOS
tokens in the vocabulary if they are used:def _create_vocab(self): - self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank] + self._vocab = [self._pad] + if self._eos: + self._vocab.append(self._eos) + if self._bos: + self._vocab.append(self._bos) + self._vocab += [self._blank] + list(self._characters) + list(self._punctuations)Alternatively, if
EOS
andBOS
are not required, please update the docstring to reflect the actual order used.As per our previous discussions, I understand that the code is kept in sync with coqui-TTS upstream, and deviations should be carefully considered.
Also applies to: 154-154
274-275
:⚠️ Potential issueInclude missing letters 'W' and 'w' in the character set
The
_letters
variable in theVitsOnnxInference
class (lines 274-275) is missing the letters 'W' and 'w'. This omission can lead to issues when processing text containing these letters, potentially causing the tokenizer to discard them or raise errors.Please verify whether the omission of 'W' and 'w' is intentional in the upstream code from coqui-TTS. If not, include these letters to ensure that the model can handle all standard English letters.
Apply this diff to include 'W' and 'w':
_letters = self.config.get("characters", {}).get("characters", - "ABCDEFGHIJKLMNOPQRSTUVXYZabcdefghijklmnopqrstuvwxyz\u00c1\u00c9\u00cd\u00d3\u00da\u00e1\u00e9\u00ed\u00f1\u00f3\u00fa\u00fc") + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00c1\u00c9\u00cd\u00d3\u00da\u00e1\u00e9\u00ed\u00f1\u00f3\u00fa\u00fc")As per our learnings, the code is intentionally kept in sync with coqui-TTS upstream implementation, so please consider this when making changes.
Docstrings generation was requested by @JarbasAl. * #10 (comment) The following files were modified: * `ovos_tts_plugin_nos/__init__.py` * `ovos_tts_plugin_nos/vits_onnx.py`
Note We have generated docstrings for this pull request, at #12 |
Docstrings generation was requested by @JarbasAl. * #10 (comment) The following files were modified: * `ovos_tts_plugin_nos/__init__.py` * `ovos_tts_plugin_nos/vits_onnx.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ovos_tts_plugin_nos/__init__.py
(3 hunks)ovos_tts_plugin_nos/vits_onnx.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
ovos_tts_plugin_nos/vits_onnx.py
84-84: SyntaxError: Unexpected indentation
404-404: SyntaxError: Expected a statement
440-440: SyntaxError: Unexpected indentation
577-577: SyntaxError: Expected a statement
🔇 Additional comments (7)
ovos_tts_plugin_nos/__init__.py (5)
5-10
: LGTM! Well-structured initialization with type safety.The imports and class variables are well-organized, with proper type hints for the engine cache dictionary.
Also applies to: 17-19
22-42
: LGTM! Well-documented constructor with proactive voice initialization.The constructor is well-documented with comprehensive docstrings and includes proactive voice downloading for better user experience.
63-63
: Avoid Usingassert
for Input ValidationUsing
assert
statements for input validation can be bypassed in optimized mode (python -O
). It's safer to raise an exception to handle invalid inputs explicitly.Apply this change:
-assert voice in ["celtia", "sabela"] +if voice not in ["celtia", "sabela"]: + raise ValueError(f"Unsupported voice: {voice}")
Line range hint
141-175
: LGTM! Comprehensive text preprocessing with clear documentation.The get_tts method includes thorough documentation and handles special cases in text preprocessing effectively.
189-211
: LGTM! Efficient engine caching with proper initialization.The get_engine method implements an efficient caching mechanism for inference engines, improving performance by reusing initialized engines.
ovos_tts_plugin_nos/vits_onnx.py (2)
404-575
: LGTM! Robust tokenization with proper error handling.The TTSTokenizer class implements comprehensive text tokenization with proper handling of out-of-vocabulary characters and special tokens.
🧰 Tools
🪛 Ruff (0.8.2)
404-404: SyntaxError: Expected a statement
440-440: SyntaxError: Unexpected indentation
577-731
: LGTM! Well-implemented ONNX inference with proper audio handling.The VitsOnnxInference class provides a robust implementation for ONNX-based text-to-speech synthesis with:
- Flexible configuration management
- Proper audio normalization
- Efficient inference pipeline
🧰 Tools
🪛 Ruff (0.8.2)
577-577: SyntaxError: Expected a statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
ovos_tts_plugin_nos/vits_onnx.py (1)
711-713
: Consider making normalization threshold configurable.The audio normalization uses a hardcoded threshold of 0.01, which might not be optimal for all cases.
Apply this diff to make the threshold configurable:
+ @staticmethod + def save_wav(wav: np.ndarray, path: str, sample_rate: int = 16000, norm_threshold: float = 0.01) -> None: - wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav)))) + wav_norm = wav * (32767 / max(norm_threshold, np.max(np.abs(wav)))) wav_norm = wav_norm.astype(np.int16) scipy.io.wavfile.write(path, sample_rate, wav_norm)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ovos_tts_plugin_nos/vits_onnx.py
(1 hunks)
🔇 Additional comments (3)
ovos_tts_plugin_nos/vits_onnx.py (3)
1-11
: LGTM! Well-structured imports and constants.The code properly imports required dependencies and uses the project's logging utility.
404-575
: LGTM! Well-implemented tokenization with proper error handling.The TTSTokenizer class provides robust text tokenization with configurable features for blank token insertion and sequence padding. The implementation properly handles out-of-vocabulary characters with appropriate logging.
338-360
:⚠️ Potential issueFix inconsistency between implementation and documentation.
The docstring states that tokens should be ordered as
[PAD, EOS, BOS, BLANK, CHARACTERS, PUNCTUATIONS]
, but the implementation uses a different order:[PAD, PUNCTUATIONS, CHARACTERS, BLANK]
. Additionally, EOS and BOS tokens are missing from the implementation.Apply this diff to fix the implementation:
- self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank] + self._vocab = [self._pad, self._eos, self._bos, self._blank] + list(self._characters) + list(self._punctuations)Likely invalid or redundant comment.
TODO_ generated audio seems to be always 3.12 seconds, long texts end abruptly mid sentence
Summary by CodeRabbit
New Features
Bug Fixes
Chores