#! https://zhuanlan.zhihu.com/p/610446501
Separates the voice of a target spk from multi-spk signals, by making use of a ref signal from the target spk. We achieve this by training two separate neural networks: (1) A spk recognition network that produces spk-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and spk embedding as input, and produces a mask.
Spk-dependent speech separation = voice filtering = spk extraction
We first train a LSTM-based spk encoder to compute robust spk embedding vectors, then train separately a T-F mask-based system that takes two inputs: (1) the embedding vector of the target spk, previously computed with the spk encoder; (2) the noisy multi-spk audio. This system is trained to remove the interfering spks and output only the voice of the target spk. This approach can be easily extended to more than one spk of interest by repeating the process in turns, for the ref recording of each target spk.
A 3-layer LSTM network produce a spk embedding (d-vector), taking log-mel filterbank energies as inputs.
Two inputs: d-vector of the target spk, a mag spectrogram computed from a noisy audio.
d-vector is repeatedly concatenated to the output of the last layer in every time frame.
WER: we want to reduce the WER in multi-spk scenarios, while preserving the same WER in single-spk scenarios.
SDR
To improve:
- Larger dataset for training spk encoder
- adding more interfering spks
- computing d-vectors over several utts instead of only one to obtain more robust spk embeddings