Formal multi-dictionary and cross-lingual support are ready for public testing #203
Replies: 5 comments 7 replies
-
I have some questions about this: 2 - The previous method, using an unified dictionary to train multi language models, one speaker can learn other languages even if they don't have samples of said language. In this method, the same thing can be replicated? If one speaker don't have french samples, for example, the model still can learn how to sing in french learning the timbre of the speaker and using bulk data? |
Beta Was this translation helpful? Give feedback.
-
I also have a question: Currently, I have my dictionaries set up in such a way that, for example, Japanese and English can be used on the same track. This is convenient for some songs that fluidly go from one language to the other, sometimes in the same sentence without any pauses. Will this still be possible with this system? One thing I always liked about DiffSinger is the fact that this is very convenient to do, and I'd be rather sad if it were to become impossible. |
Beta Was this translation helpful? Give feedback.
-
I'm really excited to implement this into my multilingual database! Question though. could I have 2 separate databases with the same spk_id? For example, if one speaker embed has 2 languages recorded, could I do one database in |
Beta Was this translation helpful? Give feedback.
-
Could we use this feature to work as a way to do different accents/dialects of a language? |
Beta Was this translation helpful? Give feedback.
-
We have already seen many multi-lingual voicebanks in the community. Most of them use separated phoneme tags with caps, suffixes or numbers to avoid conflicts, leaving chaos in labels and user experience. It should be great if we can formally support combining different dictionaries and languages in a graceful way. With the new feature in this repository, you can use the most understandable and commonly accepted labels for each language. We hope this can unify the ecosystem of DiffSinger labeling and voicebank making.
The new feature will be temporarily kept on the
multi-dict
branch. Please note that the following things are under development and testing, and may change (or even break) before they are merged into the main branch.Principles
Language-specific phonemes
If there are multiple dictionaries (languages), all language-specific phonemes will be prefixed with its language name. For example:
zh/a
,ja/o
,en/eh
. This step is done automatically by the binarizer, and you do not need to distinguish the phonemes from different languages by yourself anymore. As a cost, the slash/
becomes a reserved character and should not be used in phoneme names.We highly recommend using ISO 639 language codes as language tags. For example,
zh
andzho
stands for Chinese (cmn
specifically for Mandarin Chinese),ja
andjpn
for Japanese,en
andeng
for English,yue
for Cantonese (Yue). You can download a complete language code table from https://iso639-3.sil.org/code_tables/download_tables.Global phonemes
Some phonemes do not belong to any language, including reserved tags (
AP
andSP
), and other user-defined tags (EP
,VF
,Edge
,GlottalStop
, etc.). These tags will not be prefixed with language, and are prior when identificating phoneme names. Language-specific phonemes should not have same names with global phonemes; otherwise they will be regarded as global phonemes.Phoneme identification
There are two ways to identify a phoneme:
zh/a
for the phonemea
of the Chinese language,AP
for the global phoneme of breath.zh
),a
actually equalszh/a
, whileAP
is stillAP
(global phonemes are prior).Phoneme merging and language embedding
The phoneme set expands rapidly with the number of languages. There are actually many similar phonemes that can be merged. However, different people may have different ideas about phoneme merging, and it is very inconvenient to adjust the labels. For this reason, we now support flexible phoneme merging groups that everyone can define without pain. For example, if we merged the three phonemes
zh/i
,ja/i
anden/iy
, they all still have their names, but they will be mapped to the same phoneme ID before sending into the model. Global phonemes are never merged.Directly merging some phonemes is probably one step too far. Previous experiments have shown some accent leaks after doing this. To solve this issue, we implemented language embedding to distinguish the same phoneme in different languages. All cross-lingual phonemes (where phonemes from two or more languages are merged together) will be tagged with its actual language, but all cross-lingual phonemes from one language share the same embedding vector. In this way, the merged phonemes are closer to (but not completely the same with) each other.
Internal experiments have shown that merging phonemes and language embedding can help reducing timbre and accent leaks in cross-lingual (out-of-domain) singing compared to baseline.
Dictionaries
Dictionary format is not changed, except that you may rename your phonemes to cancel the tricks to avoid phoneme name conflicts. Global phonemes can also appear in the dictionary; all other phonemes will be treated as language-specific phonemes.
Datasets and labels
Each dataset should have a main language. If you have many recordings in multiple languages, it is recommended to separate them by language (you can merge their speaker IDs afterwards). In each dataset, the main language is set as the language context, and phoneme labels in transcriptions.csv do not need a prefix (short name). It is also valid if there are phonemes from other languages, but all of them should be prefixed with their actual language (full name). Global phonemes should not be prefixed in any datasets.
Configuration
Below is an example configuration for multi-dictionary models:
Preprocessing, training and inference
In preprocessing, you only need to cover all phoneme IDs instead of all phoneme names to pass the coverage checks.
The training API is completely the same as before.
The inference API has a new option
--lang
for multi-lingual models. This is used to define the language context for all the input segments. Without the option, alang
key in each segment can also define the language context. Rules for phoneme name sequence is the same as dataset transcription.Deployment
The ONNX exporters have been adapted to multi-lingual models. The only difference in ONNX is that models with language embedding accepts a new
languages
input besidestokens
. Meanwhile, the language and phoneme mapping (name->ID) are now given by JSON files instead of a line-by-linephonemes.txt
.OpenUTAU does not support this new feature currently. We will keep contact with its developers on how to implement multi-lingual support for a pleasant user experience.
Questions? Feedbacks? Thoughts? Experimental results? Dictionary proposals? Please leave comments below and discuss with people in the community.
Beta Was this translation helpful? Give feedback.
All reactions