Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: onnx models #10

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from
Draft

feat: onnx models #10

wants to merge 4 commits into from

Conversation

JarbasAl
Copy link
Member

@JarbasAl JarbasAl commented Jan 19, 2025

TODO_ generated audio seems to be always 3.12 seconds, long texts end abruptly mid sentence

Summary by CodeRabbit

  • New Features

    • Added support for VITS ONNX-based text-to-speech inference.
    • Introduced new voice models with improved synthesis capabilities.
    • Implemented advanced text tokenization and audio generation.
    • Enhanced voice downloading and initialization process.
  • Bug Fixes

    • Streamlined voice engine management.
  • Chores

    • Updated project dependencies.
    • Removed previous TTS library dependency.
    • Added support for numpy, scipy, and ONNX runtime.

Copy link

coderabbitai bot commented Jan 19, 2025

Walkthrough

This pull request introduces significant enhancements to the text-to-speech (TTS) plugin for NOS voices, transitioning to a custom ONNX-based inference system. Key modifications include the creation of a new VITS ONNX inference module, updates to the NosTTSPlugin class for better voice management, and adjustments in the requirements file to incorporate necessary libraries. The changes streamline the voice downloading and initialization process while ensuring compatibility with new model paths.

Changes

File Changes
ovos_tts_plugin_nos/__init__.py - Updated import statements
- Added VOICE2ENGINE class variable
- Modified download method to accept voice parameter
- Simplified get_tts method
- Redefined get_engine method
ovos_tts_plugin_nos/vits_onnx.py - Added Graphemes class for vocabulary management
- Added TTSTokenizer for text encoding
- Added VitsOnnxInference for ONNX-based TTS inference
- Implemented methods for text-to-speech synthesis
requirements.txt - Removed coqui-tts package
- Added numpy, scipy, and onnxruntime packages

Possibly related PRs

  • feat: text normalization #5: The changes in the NosTTSPlugin class in both PRs involve modifications to the __init__ and get_tts methods, indicating a direct relationship in how the class is structured and how it processes text for TTS.

Poem

🐰 A Rabbit's Ode to Voice Synthesis 🎤

ONNX models dance with glee,
Celtia and Sabela sing so free,
From Coqui's realm to inference bright,
Our voices leap with coding might!
Synthesizing words with rabbit delight! 🎶

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added feature and removed feature labels Jan 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
ovos_tts_plugin_nos/__init__.py (4)

5-5: Unused Import: Remove Dict from typing

The import statement from typing import Dict adds Dict, but there is no usage of Dict in the code. Removing unused imports can clean up the code and avoid confusion.

Apply this change:

-from typing import Dict

28-29: Redundant Comment: Remove or Expand

The comment # pre-download voices on init if needed is redundant because the code self.get_engine(self.voice) is self-explanatory. If additional context is necessary, consider expanding the comment; otherwise, it can be removed.


105-111: Rename Variable tts for Clarity

In the get_tts method, the variable tts is used to store the engine instance. For clarity, consider renaming it to engine or inference_engine to more accurately reflect its purpose.

Apply this change:

-        tts = self.get_engine(voice)
-        tts.synth(sentence, wav_file)
+        engine = self.get_engine(voice)
+        engine.synth(sentence, wav_file)

125-131: Manage Memory Usage for Cached Engines

Caching VitsOnnxInference instances for each voice can lead to increased memory usage if many voices are supported. Consider:

  • Implementing a cache eviction policy.
  • Limiting the number of cached instances.
  • Allowing for manual release of resources when an engine is no longer needed.
ovos_tts_plugin_nos/vits_onnx.py (4)

1-4: Add Module-Level Docstring

The module lacks a docstring that describes its purpose. Adding a module-level docstring improves code readability and provides context for other developers.

Apply this change:

# minimal onnx inference extracted from coqui-tts

"""
This module provides ONNX-based text-to-speech inference using the VITS model.
It includes classes for managing graphemes, tokenizing text, and performing inference.
"""

58-65: Optimize _create_vocab Calls

The _create_vocab method is called during initialization and every time a property setter is used. To optimize:

  • Check if properties are being modified after initialization. If not, the multiple calls to _create_vocab are unnecessary.
  • If properties might change, consider lazy evaluation or tracking changes to rebuild the vocabulary only when necessary.

257-266: Clarify Execution Provider Selection Logic

The current logic for setting execution providers can be simplified for clarity:

  • Explicitly set providers based on the cuda flag.
  • Remove unnecessary ternary operations for better readability.

Apply this change:

     if cuda:
-        providers = [
-            ("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"})
-        ]
+        providers = ["CUDAExecutionProvider"]
     else:
         providers = ["CPUExecutionProvider"]

290-300: Enhance Text Normalization

The normalize_text method performs basic text normalization. To improve speech synthesis quality, consider:

  • Handling special characters and diacritics.
  • Expanding numbers, dates, and abbreviations into words.
  • Removing or substituting unsupported characters.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d16108d and af68de7.

📒 Files selected for processing (3)
  • ovos_tts_plugin_nos/__init__.py (4 hunks)
  • ovos_tts_plugin_nos/vits_onnx.py (1 hunks)
  • requirements.txt (1 hunks)
🔇 Additional comments (4)
ovos_tts_plugin_nos/__init__.py (2)

17-19: Consistent Naming Convention for Constants

The constants CELTIA and SABELA are defined in uppercase, which follows the convention for constants. Ensure that all constants in the codebase use consistent naming for readability and maintainability.


110-111: ⚠️ Potential issue

Ensure Thread Safety for VOICE2ENGINE Dictionary

The class variable VOICE2ENGINE is modified without synchronization, which can lead to race conditions in multi-threaded environments. To make it thread-safe:

  • Use a threading lock when accessing or modifying VOICE2ENGINE.
  • Alternatively, use thread-safe data structures or move VOICE2ENGINE to an instance variable if appropriate.

To verify this issue, review how NosTTSPlugin instances are created and used in multi-threaded contexts within the application.

ovos_tts_plugin_nos/vits_onnx.py (1)

256-273: Review the Stability of Minimal ONNX Inference

The comment indicates that this is a minimal inference code extracted from Coqui TTS. Ensure that:

  • The implementation is stable and handles all necessary edge cases.
  • Adequate testing is performed to validate performance and correctness.
  • Documentation is updated to reflect production readiness.
requirements.txt (1)

4-4: Verify Necessity of ovos-tts-plugin-cotovia

With the shift to ONNX models, confirm if the dependency ovos-tts-plugin-cotovia is still required. Removing unnecessary dependencies simplifies maintenance and reduces potential conflicts.

To verify, check if CotoviaTTSPlugin or related functionality is still utilized in the codebase.

ovos_tts_plugin_nos/__init__.py Show resolved Hide resolved
ovos_tts_plugin_nos/__init__.py Outdated Show resolved Hide resolved
ovos_tts_plugin_nos/vits_onnx.py Outdated Show resolved Hide resolved
ovos_tts_plugin_nos/vits_onnx.py Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
@github-actions github-actions bot added feature and removed feature labels Jan 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
ovos_tts_plugin_nos/vits_onnx.py (2)

41-41: Fix typo in class docstring

There appears to be an incomplete or misspelled word 'el' at line 41 in the Graphemes class docstring. This may cause confusion and should be corrected for clarity.


254-263: Clarify the providers configuration for ONNX Runtime session

In the VitsOnnxInference class constructor (lines 254-263), the providers list is constructed using a conditional expression within a list. The current implementation may lead to readability issues.

Consider refactoring the providers assignment for better clarity:

-            providers = [
-                "CPUExecutionProvider"
-                if cuda is False
-                else ("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"})
-            ]
+            if cuda is False:
+                providers = ["CPUExecutionProvider"]
+            else:
+                providers = [("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"})]

This change improves readability and makes the code easier to maintain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af68de7 and 58e0e0f.

📒 Files selected for processing (1)
  • ovos_tts_plugin_nos/vits_onnx.py (1 hunks)
🔇 Additional comments (2)
ovos_tts_plugin_nos/vits_onnx.py (2)

16-17: ⚠️ Potential issue

Ensure consistency between vocabulary order in documentation and implementation

The class docstring (lines 16-17) states that characters are ordered as [PAD, EOS, BOS, BLANK, CHARACTERS, PUNCTUATIONS]. However, in the _create_vocab method (line 154), the order is [PAD, PUNCTUATIONS, CHARACTERS, BLANK], and the EOS and BOS tokens are not included in the vocabulary.

This inconsistency may lead to incorrect token mappings and affect the tokenizer's behavior.

Consider updating the _create_vocab method to include the EOS and BOS tokens in the vocabulary if they are used:

 def _create_vocab(self):
-    self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
+    self._vocab = [self._pad]
+    if self._eos:
+        self._vocab.append(self._eos)
+    if self._bos:
+        self._vocab.append(self._bos)
+    self._vocab += [self._blank] + list(self._characters) + list(self._punctuations)

Alternatively, if EOS and BOS are not required, please update the docstring to reflect the actual order used.

As per our previous discussions, I understand that the code is kept in sync with coqui-TTS upstream, and deviations should be carefully considered.

Also applies to: 154-154


274-275: ⚠️ Potential issue

Include missing letters 'W' and 'w' in the character set

The _letters variable in the VitsOnnxInference class (lines 274-275) is missing the letters 'W' and 'w'. This omission can lead to issues when processing text containing these letters, potentially causing the tokenizer to discard them or raise errors.

Please verify whether the omission of 'W' and 'w' is intentional in the upstream code from coqui-TTS. If not, include these letters to ensure that the model can handle all standard English letters.

Apply this diff to include 'W' and 'w':

             _letters = self.config.get("characters", {}).get("characters",
-                                                             "ABCDEFGHIJKLMNOPQRSTUVXYZabcdefghijklmnopqrstuvwxyz\u00c1\u00c9\u00cd\u00d3\u00da\u00e1\u00e9\u00ed\u00f1\u00f3\u00fa\u00fc")
+                                                             "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00c1\u00c9\u00cd\u00d3\u00da\u00e1\u00e9\u00ed\u00f1\u00f3\u00fa\u00fc")

As per our learnings, the code is intentionally kept in sync with coqui-TTS upstream implementation, so please consider this when making changes.

coderabbitai bot added a commit that referenced this pull request Jan 19, 2025
Docstrings generation was requested by @JarbasAl.

* #10 (comment)

The following files were modified:

* `ovos_tts_plugin_nos/__init__.py`
* `ovos_tts_plugin_nos/vits_onnx.py`
Copy link

coderabbitai bot commented Jan 19, 2025

Note

We have generated docstrings for this pull request, at #12

Docstrings generation was requested by @JarbasAl.

* #10 (comment)

The following files were modified:

* `ovos_tts_plugin_nos/__init__.py`
* `ovos_tts_plugin_nos/vits_onnx.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added feature and removed feature labels Jan 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58e0e0f and f7f5bdc.

📒 Files selected for processing (2)
  • ovos_tts_plugin_nos/__init__.py (3 hunks)
  • ovos_tts_plugin_nos/vits_onnx.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
ovos_tts_plugin_nos/vits_onnx.py

84-84: SyntaxError: Unexpected indentation


404-404: SyntaxError: Expected a statement


440-440: SyntaxError: Unexpected indentation


577-577: SyntaxError: Expected a statement

🔇 Additional comments (7)
ovos_tts_plugin_nos/__init__.py (5)

5-10: LGTM! Well-structured initialization with type safety.

The imports and class variables are well-organized, with proper type hints for the engine cache dictionary.

Also applies to: 17-19


22-42: LGTM! Well-documented constructor with proactive voice initialization.

The constructor is well-documented with comprehensive docstrings and includes proactive voice downloading for better user experience.


63-63: Avoid Using assert for Input Validation

Using assert statements for input validation can be bypassed in optimized mode (python -O). It's safer to raise an exception to handle invalid inputs explicitly.

Apply this change:

-assert voice in ["celtia", "sabela"]
+if voice not in ["celtia", "sabela"]:
+    raise ValueError(f"Unsupported voice: {voice}")

Line range hint 141-175: LGTM! Comprehensive text preprocessing with clear documentation.

The get_tts method includes thorough documentation and handles special cases in text preprocessing effectively.


189-211: LGTM! Efficient engine caching with proper initialization.

The get_engine method implements an efficient caching mechanism for inference engines, improving performance by reusing initialized engines.

ovos_tts_plugin_nos/vits_onnx.py (2)

404-575: LGTM! Robust tokenization with proper error handling.

The TTSTokenizer class implements comprehensive text tokenization with proper handling of out-of-vocabulary characters and special tokens.

🧰 Tools
🪛 Ruff (0.8.2)

404-404: SyntaxError: Expected a statement


440-440: SyntaxError: Unexpected indentation


577-731: LGTM! Well-implemented ONNX inference with proper audio handling.

The VitsOnnxInference class provides a robust implementation for ONNX-based text-to-speech synthesis with:

  • Flexible configuration management
  • Proper audio normalization
  • Efficient inference pipeline
🧰 Tools
🪛 Ruff (0.8.2)

577-577: SyntaxError: Expected a statement

ovos_tts_plugin_nos/vits_onnx.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
ovos_tts_plugin_nos/vits_onnx.py (1)

711-713: Consider making normalization threshold configurable.

The audio normalization uses a hardcoded threshold of 0.01, which might not be optimal for all cases.

Apply this diff to make the threshold configurable:

+    @staticmethod
+    def save_wav(wav: np.ndarray, path: str, sample_rate: int = 16000, norm_threshold: float = 0.01) -> None:
-        wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
+        wav_norm = wav * (32767 / max(norm_threshold, np.max(np.abs(wav))))
         wav_norm = wav_norm.astype(np.int16)
         scipy.io.wavfile.write(path, sample_rate, wav_norm)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7f5bdc and 3ed9097.

📒 Files selected for processing (1)
  • ovos_tts_plugin_nos/vits_onnx.py (1 hunks)
🔇 Additional comments (3)
ovos_tts_plugin_nos/vits_onnx.py (3)

1-11: LGTM! Well-structured imports and constants.

The code properly imports required dependencies and uses the project's logging utility.


404-575: LGTM! Well-implemented tokenization with proper error handling.

The TTSTokenizer class provides robust text tokenization with configurable features for blank token insertion and sequence padding. The implementation properly handles out-of-vocabulary characters with appropriate logging.


338-360: ⚠️ Potential issue

Fix inconsistency between implementation and documentation.

The docstring states that tokens should be ordered as [PAD, EOS, BOS, BLANK, CHARACTERS, PUNCTUATIONS], but the implementation uses a different order: [PAD, PUNCTUATIONS, CHARACTERS, BLANK]. Additionally, EOS and BOS tokens are missing from the implementation.

Apply this diff to fix the implementation:

-        self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
+        self._vocab = [self._pad, self._eos, self._bos, self._blank] + list(self._characters) + list(self._punctuations)

Likely invalid or redundant comment.

ovos_tts_plugin_nos/vits_onnx.py Show resolved Hide resolved
ovos_tts_plugin_nos/vits_onnx.py Show resolved Hide resolved
@JarbasAl JarbasAl added the help wanted Extra attention is needed label Jan 19, 2025
@JarbasAl JarbasAl marked this pull request as draft January 19, 2025 07:43
@github-actions github-actions bot added feature and removed feature labels Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant