You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm copying in some details from the Appendix for discussion about the way they marked up their phones, they added ghost silences and modified the phones to have "word positions" which feels like a sensible idea, not sure how they implemented the phone masking, but infill training seems quite a novel idea and I wonder if any of this could feed into DiffSinger training?
A.2 Phone representation
Ghost silence
The frame-level phonetic transcript used for training is obtained through force-
aligning speech and phonetic transcript. In particular, a forced aligner may align some frames to a
special phone “SIL” for non-speech frames (silence or noise). For most forced aligners, only frames
between words and frames at the beginning and at the end of an utterance can be aligned to SIL.
During inference, we are only given the text transcript, which does not tell us where we should insert
silence to. Hence, it is desired to have the duration model not only predict the duration for each phone
(SIL included), but also predict the existence of SIL at eligible locations (between words and at the
two ends of the utterance). To tackle it, we introduce ghost silence to our phonetic transcript, which
are silences in between words with duration of zero frames.
To give an example, suppose the transcript contains three words: "Hey what’s up" with pronunciation "{Hey:[A,B], what’s:[C], up:[D,E,F]}", and the frame-level phonetic transcript z obtained through forced alignment is z = (SIL A B B SIL C D D D E E F SIL SIL).
The phonetic transcripts becomes
y = (SIL A B SIL C SIL D E F SIL)
where the ghost silence is highlighted in bold. The corresponding duration would be
l = (1, 1, 2, 1, 1, 0, 3, 2, 1, 2).
A ghost silence is inserted between "what’s" and "up" during training, and the duration model should predict the
duration of it as zero to indicate that there should not be a pause between the two words.
Word-position-dependent phone
The possible absence of silence between words in the frame-
level phone transcript can make it hard for the audio model to identify word boundaries. To help
the audio model identify the word boundary which is important when reading a sentence, we
introduce word-position-dependent phones which are commonly used in Hidden Markov Model
based acoustic models for speech recognition [Povey et al., 2011]. This adds a postfix to each phone
in the transcript to denote where it is in the corresponding word. There are four postfixes: _B for
beginning, _E for end, _I for intermediate, and _S for singleton.
The above example becomes
“{Hey:[A_B,B_E], what’s:[C_S], up:[D_B,E_I,F_E]}” with frame-level phonetic transcript
z = (SIL A_B B_E B_E SIL C_S D_B D_B D_B E_I E_I F_E SIL SIL).
Phone-level mask
In terms of masking, given duration l, the relationship of phone-level mask m′
and frame-level mask m can be written as m = rep(m′, l). For the applications where a duration
model is involved (zero-shot TTS, content editing, diverse speech sampling), the frame-level mask
m is extended such that no phone is partially masked. In other words, all the frames corresponding
to a phone is either entirely masked or entirely unmasked.
During training, we mask a contiguous chunk of audio, infilling of which is a more challenging task compared to infilling multiple smaller segments. All frames that are aligned to a phone are either entirely masked or unmasked. Note that
masking all frames for a phone is not a necessity but was chosen due to ease of implementation
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Meta recently published their Voicebox paper.
voicebox-generative-ai-model-speech
I'm copying in some details from the Appendix for discussion about the way they marked up their phones, they added ghost silences and modified the phones to have "word positions" which feels like a sensible idea, not sure how they implemented the phone masking, but infill training seems quite a novel idea and I wonder if any of this could feed into DiffSinger training?
A.2 Phone representation
Ghost silence
The frame-level phonetic transcript used for training is obtained through force-
aligning speech and phonetic transcript. In particular, a forced aligner may align some frames to a
special phone “SIL” for non-speech frames (silence or noise). For most forced aligners, only frames
between words and frames at the beginning and at the end of an utterance can be aligned to SIL.
During inference, we are only given the text transcript, which does not tell us where we should insert
silence to. Hence, it is desired to have the duration model not only predict the duration for each phone
(SIL included), but also predict the existence of SIL at eligible locations (between words and at the
two ends of the utterance). To tackle it, we introduce ghost silence to our phonetic transcript, which
are silences in between words with duration of zero frames.
To give an example, suppose the transcript contains three words: "Hey what’s up" with pronunciation "{Hey:[A,B], what’s:[C], up:[D,E,F]}", and the frame-level phonetic transcript z obtained through forced alignment is z = (SIL A B B SIL C D D D E E F SIL SIL).
The phonetic transcripts becomes
y = (SIL A B SIL C SIL D E F SIL)
where the ghost silence is highlighted in bold. The corresponding duration would be
l = (1, 1, 2, 1, 1, 0, 3, 2, 1, 2).
A ghost silence is inserted between "what’s" and "up" during training, and the duration model should predict the
duration of it as zero to indicate that there should not be a pause between the two words.
Word-position-dependent phone
The possible absence of silence between words in the frame-
level phone transcript can make it hard for the audio model to identify word boundaries. To help
the audio model identify the word boundary which is important when reading a sentence, we
introduce word-position-dependent phones which are commonly used in Hidden Markov Model
based acoustic models for speech recognition [Povey et al., 2011]. This adds a postfix to each phone
in the transcript to denote where it is in the corresponding word. There are four postfixes: _B for
beginning, _E for end, _I for intermediate, and _S for singleton.
The above example becomes
“{Hey:[A_B,B_E], what’s:[C_S], up:[D_B,E_I,F_E]}” with frame-level phonetic transcript
z = (SIL A_B B_E B_E SIL C_S D_B D_B D_B E_I E_I F_E SIL SIL).
Phone-level mask
In terms of masking, given duration l, the relationship of phone-level mask m′
and frame-level mask m can be written as m = rep(m′, l). For the applications where a duration
model is involved (zero-shot TTS, content editing, diverse speech sampling), the frame-level mask
m is extended such that no phone is partially masked. In other words, all the frames corresponding
to a phone is either entirely masked or entirely unmasked.
During training, we mask a contiguous chunk of audio, infilling of which is a more challenging task compared to infilling multiple smaller segments. All frames that are aligned to a phone are either entirely masked or unmasked. Note that
masking all frames for a phone is not a necessity but was chosen due to ease of implementation
Beta Was this translation helpful? Give feedback.
All reactions