Formal multi-dictionary and cross-lingual support are ready for public testing #203

yqzhishen · 2024-08-04T09:29:37Z

yqzhishen
Aug 4, 2024
Maintainer

We have already seen many multi-lingual voicebanks in the community. Most of them use separated phoneme tags with caps, suffixes or numbers to avoid conflicts, leaving chaos in labels and user experience. It should be great if we can formally support combining different dictionaries and languages in a graceful way. With the new feature in this repository, you can use the most understandable and commonly accepted labels for each language. We hope this can unify the ecosystem of DiffSinger labeling and voicebank making.

The new feature will be temporarily kept on the multi-dict branch. Please note that the following things are under development and testing, and may change (or even break) before they are merged into the main branch.

Principles

Language-specific phonemes

If there are multiple dictionaries (languages), all language-specific phonemes will be prefixed with its language name. For example: zh/a, ja/o, en/eh. This step is done automatically by the binarizer, and you do not need to distinguish the phonemes from different languages by yourself anymore. As a cost, the slash / becomes a reserved character and should not be used in phoneme names.

We highly recommend using ISO 639 language codes as language tags. For example, zh and zho stands for Chinese (cmn specifically for Mandarin Chinese), ja and jpn for Japanese, en and eng for English, yue for Cantonese (Yue). You can download a complete language code table from https://iso639-3.sil.org/code_tables/download_tables.

Global phonemes

Some phonemes do not belong to any language, including reserved tags (AP and SP), and other user-defined tags (EP, VF, Edge, GlottalStop, etc.). These tags will not be prefixed with language, and are prior when identificating phoneme names. Language-specific phonemes should not have same names with global phonemes; otherwise they will be regarded as global phonemes.

Phoneme identification

There are two ways to identify a phoneme:

With the full name. For example, zh/a for the phoneme a of the Chinese language, AP for the global phoneme of breath.
With a language context and the short name. For example, in a context defined as Chinese language (zh), a actually equals zh/a, while AP is still AP (global phonemes are prior).

Phoneme merging and language embedding

The phoneme set expands rapidly with the number of languages. There are actually many similar phonemes that can be merged. However, different people may have different ideas about phoneme merging, and it is very inconvenient to adjust the labels. For this reason, we now support flexible phoneme merging groups that everyone can define without pain. For example, if we merged the three phonemes zh/i, ja/i and en/iy, they all still have their names, but they will be mapped to the same phoneme ID before sending into the model. Global phonemes are never merged.

Directly merging some phonemes is probably one step too far. Previous experiments have shown some accent leaks after doing this. To solve this issue, we implemented language embedding to distinguish the same phoneme in different languages. All cross-lingual phonemes (where phonemes from two or more languages are merged together) will be tagged with its actual language, but all cross-lingual phonemes from one language share the same embedding vector. In this way, the merged phonemes are closer to (but not completely the same with) each other.

Internal experiments have shown that merging phonemes and language embedding can help reducing timbre and accent leaks in cross-lingual (out-of-domain) singing compared to baseline.

Dictionaries

Dictionary format is not changed, except that you may rename your phonemes to cancel the tricks to avoid phoneme name conflicts. Global phonemes can also appear in the dictionary; all other phonemes will be treated as language-specific phonemes.

Datasets and labels

Each dataset should have a main language. If you have many recordings in multiple languages, it is recommended to separate them by language (you can merge their speaker IDs afterwards). In each dataset, the main language is set as the language context, and phoneme labels in transcriptions.csv do not need a prefix (short name). It is also valid if there are phonemes from other languages, but all of them should be prefixed with their actual language (full name). Global phonemes should not be prefixed in any datasets.

Configuration

Below is an example configuration for multi-dictionary models:

dictionaries:  # define all language tags and their corresponding dictionary
  zh: dictionaries/opencpop-extension.txt
  ja: dictionaries/japanese_dict_full.txt
  en: dictionaries/ds_cmudict-07b.txt
extra_phonemes: []  # user-defined global phonemes + addictional language-specific phonemes (full name should be used)
merged_phoneme_groups:  # define all phoneme merging groups; global phonemes should not appear here
  - [zh/i, ja/i, en/iy]
  - [zh/s, ja/s, en/s]
  # ... (other groups omitted for brevity)

datasets:  # define all raw datasets
  - raw_data_dir: data/xxx1/raw  # path of the dataset (equivalent to former raw_data_dir)
    speaker: speaker1  # speaker name (equivalent to former speakers)
    spk_id: 0  # optional; use this to merge two datasets; otherwise automatically assigned
    language: zh  # language tag (main language) of this dataset
    test_prefixes:  # optional; validation samples from this dataset
      - wav1
      - wav2
  - raw_data_dir: data/xxx2/raw
    speaker: speaker2
    spk_id: 1
    language: ja
    test_prefixes:
      - wav1
      - wav2
  # ... (other datasets omitted for brevity)

use_lang_id: true  # whether to use language embedding; only take effects if there are cross-lingual phonemes
num_lang: 3  # number of languages; should be >= number of defined languages

Preprocessing, training and inference

In preprocessing, you only need to cover all phoneme IDs instead of all phoneme names to pass the coverage checks.

The training API is completely the same as before.

The inference API has a new option --lang for multi-lingual models. This is used to define the language context for all the input segments. Without the option, a lang key in each segment can also define the language context. Rules for phoneme name sequence is the same as dataset transcription.

Deployment

The ONNX exporters have been adapted to multi-lingual models. The only difference in ONNX is that models with language embedding accepts a new languages input besides tokens. Meanwhile, the language and phoneme mapping (name->ID) are now given by JSON files instead of a line-by-line phonemes.txt.

OpenUTAU does not support this new feature currently. We will keep contact with its developers on how to implement multi-lingual support for a pleasant user experience.

Questions? Feedbacks? Thoughts? Experimental results? Dictionary proposals? Please leave comments below and discuss with people in the community.

akamstsu · 2024-08-04T09:47:19Z

akamstsu
Aug 4, 2024

ds_cmudict-07b.txt
japanese_dict_full.txt

0 replies

overdramatic · 2024-08-05T11:16:38Z

overdramatic
Aug 5, 2024

I have some questions about this:
1 - Some samples can have more then one language being sang, like we can see from k-pop nowadays singing in korean and english. If we manually add those tags on the label of those phonemes that is not the main language, the training will identify it properly?

2 - The previous method, using an unified dictionary to train multi language models, one speaker can learn other languages even if they don't have samples of said language. In this method, the same thing can be replicated? If one speaker don't have french samples, for example, the model still can learn how to sing in french learning the timbre of the speaker and using bulk data?

1 reply

yqzhishen Aug 5, 2024
Maintainer Author

Yes, explicit language prefixes can be used in transcriptions.csv and will be identified properly.
This change is mostly a preprocessing trick on phoneme names. The model side is barely unchanged, except one difference: if you merge some cross-lingual phonemes properly and enable language embedding, the cross-lingual singing performance can be even better (you have to do experiments, though).

lottev1991 · 2024-08-05T16:00:57Z

lottev1991
Aug 5, 2024

I also have a question: Currently, I have my dictionaries set up in such a way that, for example, Japanese and English can be used on the same track. This is convenient for some songs that fluidly go from one language to the other, sometimes in the same sentence without any pauses. Will this still be possible with this system? One thing I always liked about DiffSinger is the fact that this is very convenient to do, and I'd be rather sad if it were to become impossible.

5 replies

yqzhishen Aug 5, 2024
Maintainer Author

This is of course possible, theoretically. At least if you directly input the phoneme sequences then it will work. But I think it is an OpenUTAU issue about how to implement the new multi-dictionary system in a more convenient way. My personal opinion is that we should have some language settings at track-level, note-level and phoneme-level.

yqzhishen Aug 5, 2024
Maintainer Author

BTW, how did you use different languages in a same track before? If there are same words in multiple languages, how to distinguish them?

lottev1991 Aug 5, 2024

I don't do it for every language, only the ones with differing scripts (e.g. Japanese/Korean with English, or with each other, etc). I don't do it with, say, English and French (or pinyin/jyutping) exactly because of the possible spelling conflicts. Though I have seen users do something like this with special word prefixes and suffixes; I just haven't experimented with this myself.

I agree that being able to set some sort of language instance on the part of a track would be a great feature. I once read that this wasn't going to be implemented in OpenUtau, though this was already some time ago (plus this mostly applied to UTAU banks, most of which are monolingual, reflecting the historical baggage). I do however think it would be a great possible feature for TuneLab and/or DiffScope in the future.

tuskinekinase Aug 7, 2024

I don't do it for every language, only the ones with differing scripts (e.g. Japanese/Korean with English, or with each other, etc). I don't do it with, say, English and French (or pinyin/jyutping) exactly because of the possible spelling conflicts. Though I have seen users do something like this with special word prefixes and suffixes; I just haven't experimented with this myself.

I agree that being able to set some sort of language instance on the part of a track would be a great feature. I once read that this wasn't going to be implemented in OpenUtau, though this was already some time ago (plus this mostly applied to UTAU banks, most of which are monolingual, reflecting the historical baggage). I do however think it would be a great possible feature for TuneLab and/or DiffScope in the future.

I second this. I reckon you're using DIFFS and your own fusion dictionary via your own dict.yaml? We're already doing that. The English words in my dict which have conflicts with pinyin have "EN" prefixes, and the French part of my dict is prefixed with "FR" universally. It's just that putting different phoneme system together under the current OpenUtau framework is rather...not elegant. Theoretically it should be possible, We just don't have that elegant UI to enable this yet.

lottev1991 Aug 7, 2024

@tuskinekinase No, actually - all of my dsdicts for each language support kana (both hiragana and katakana), and the English one also supports Hangeul, with the drawback of no sandhi support (this is exclusive to the Korean phonemizers). I did this in part to make them work with G2P replacements in OpenUtau, since some of my phoneme labeling is different from the G2Ps (English is the sole exception) but also to make, for example, the Korean dict support romanized input (for users who can't type Hangeul, although again no sandhi support since that's Hangeul-only). If I were to put all of this in the main dsdict, I wouldn't be able to use any of these functions, at least not without conflicts.

spicytigermeat · 2024-08-05T22:37:54Z

spicytigermeat
Aug 5, 2024

I'm really excited to implement this into my multilingual database! Question though. could I have 2 separate databases with the same spk_id? For example, if one speaker embed has 2 languages recorded, could I do one database in en and one in ja so that the single speaker embed has 2 languages? It seems to be the case in the example config you provided, but I wanted to make sure.

1 reply

yqzhishen Aug 5, 2024
Maintainer Author

Yes, you can have the same spk_id for multiple datasets. And you only need to manually assign spk_ids for datasets you want to merge. Others will be automatically assigned.

AnAndroNerd · 2024-08-06T04:19:13Z

AnAndroNerd
Aug 6, 2024

Could we use this feature to work as a way to do different accents/dialects of a language?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formal multi-dictionary and cross-lingual support are ready for public testing #203

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Formal multi-dictionary and cross-lingual support are ready for public testing #203

yqzhishen Aug 4, 2024 Maintainer

Principles

Language-specific phonemes

Global phonemes

Phoneme identification

Phoneme merging and language embedding

Dictionaries

Datasets and labels

Configuration

Preprocessing, training and inference

Deployment

Replies: 5 comments · 7 replies

yqzhishen Aug 5, 2024 Maintainer Author

yqzhishen Aug 5, 2024 Maintainer Author

yqzhishen Aug 5, 2024 Maintainer Author

yqzhishen Aug 5, 2024 Maintainer Author

yqzhishen
Aug 4, 2024
Maintainer

Replies: 5 comments 7 replies

yqzhishen Aug 5, 2024
Maintainer Author

yqzhishen Aug 5, 2024
Maintainer Author

yqzhishen Aug 5, 2024
Maintainer Author

yqzhishen Aug 5, 2024
Maintainer Author