#! https://zhuanlan.zhihu.com/p/611737295
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers 阅读笔记
Traditional speaker extraction models fail in scenarios when the target speaker is absent from the mixture.
Propose to handle speech mixtures with one or two talkers in which the target speaker can either be present or absent.
SE uses a spk's ref signal to extract the target spk's voice in a multi-talker speech mixture w/o any prior knowledge about the number of speakers.
In the presence of the target speaker, the model extracts the target speaker’s voice, and in the absence of the target speaker, the model is expected to output silence. We intro a joint training scheme with one unified loss func for all four conditions.
A universal speaker extraction system should perform under four conditions to cover all acoustic scenarios in everyday conversational situations.
Speaker extraction models have been mostly trained on two- or multi-talker scenarios in the presence of the target speakers, denoted as 2T-PT. Those models usually fail when being applied to one of the other three scenarios, where they show poor performance, extract artifacts, or recover a random speaker's voice. The failure is not desired. Rather the model shall either extract the target speaker's voice (PT) or silence (AT).
Propose a joint training scheme with a unified loss function as we are building one system for all four conditions.
Commonly a little stabilization value
Conduct exp on diff training schemes and database compositions, while maintaining the same network architecture for fair comparisons.