Why the embedding layer instead of the one-hot audio vector? #20

ivancarapinha · 2020-11-01T11:15:36Z

Hello,

In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.

So, I have two (actually three) questions:

What is the purpose of the one-hot audio vector in the original implementation?
Why did you replace the one-hot vector with an embedding layer? What changed in the model behavior with this replacement?

Thank you very much

bshall · 2020-11-12T16:44:21Z

Hi @ivancarapinha,

Sorry about the delay!

Yeah, the paper is very vague about the model details. You're correct that the one-hot representation is related to the 10-bit audio. Basically they apply mu-law companding to the original 16-bit audio. Then you form a one-hot representation for each sample where the 1 is at the index given by the mu-law companding. This is then fed into the autoregressive part of the model.

I used an embedding layer just to make the model a bit more efficient. The first operation in a GRU is a matrix multiplication with the input. So using a one-hot input picks out a column of the matrix (basically what an embedding layer does). I just separated out the embedding operation and used a smaller dimension which hopefully sped things up training a little. It should work fine if you go with the original approach though.

Hope that helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the embedding layer instead of the one-hot audio vector? #20

Why the embedding layer instead of the one-hot audio vector? #20

ivancarapinha commented Nov 1, 2020

bshall commented Nov 12, 2020

Why the embedding layer instead of the one-hot audio vector? #20

Why the embedding layer instead of the one-hot audio vector? #20

Comments

ivancarapinha commented Nov 1, 2020

bshall commented Nov 12, 2020