Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wav2vecDS是否可以做得更通用 #22

Open
tailangjun opened this issue Apr 2, 2024 · 4 comments
Open

Wav2vecDS是否可以做得更通用 #22

tailangjun opened this issue Apr 2, 2024 · 4 comments

Comments

@tailangjun
Copy link

我发现 torchaudio.pipelines.HUBERT_ASR_LARGE输出的音频特征的维度是 (m, 29), DeepSpeech v0.1输出的音频特征的维度是 (n, 29),m和n相差不大,Wav2vecDS的作用是将 (m, 29)维度的特征映射到 (n, 29)维度。
我在想是不是可以把 Wav2vecDS做得更加通用一些,支持将任意维度的特征映射到 (n, 29)维度,类似于 ER-NeRF/nerf_triplane/network.py中的 AudioNet。这样就可以随便选用支持中文的模型来提取语音特征,比方说 chinese-wav2vec2-large和 chinese-hubert-large。

I found that the dimension of the audio feature output by torchaudio.pipelines.HUBERT_ASR_LARGE is (m, 29), and the dimension of the audio feature output by DeepSpeech v0.1 is (n, 29). m and n are not much different. The function of Wav2vecDS is to convert ( Features in m, 29) dimensions are mapped to (n, 29) dimensions.
I'm wondering if Wav2vecDS can be made more general and support mapping features of any dimension to (n, 29) dimensions, similar to AudioNet in ER-NeRF/nerf_triplane/network.py. In this way, you can choose any model that supports Chinese to extract speech features, such as chinese-wav2vec2-large and chinese-hubert-large.

@Elsaam2y
Copy link
Owner

Yes, thanks for the recommendation. I tried doing so mainly to support Chinese, however the mapping became more complex and the output features weren't always convincing as noticed from the output lip-sync.

@Elsaam2y
Copy link
Owner

But please feel free to open a PR if you worked on this and managed to get better results.

@tailangjun
Copy link
Author

Yes, thanks for the recommendation. I tried doing so mainly to support Chinese, however the mapping became more complex and the output features weren't always convincing as noticed from the output lip-sync.

懂了,谢谢

@PengYicong
Copy link

I see that the mapping network Wav2VecDS is a simple MLP net. Maybe the mapping from complex features to this DS output needs a modern structure like transformers? I'm curious if you can provide some training information regarding the datasets and SyncNet configuration. I can try to train a network in my spare time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants