You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.
So, I have two (actually three) questions:
What is the purpose of the one-hot audio vector in the original implementation?
Why did you replace the one-hot vector with an embedding layer? What changed in the model behavior with this replacement?
Thank you very much
The text was updated successfully, but these errors were encountered:
Yeah, the paper is very vague about the model details. You're correct that the one-hot representation is related to the 10-bit audio. Basically they apply mu-law companding to the original 16-bit audio. Then you form a one-hot representation for each sample where the 1 is at the index given by the mu-law companding. This is then fed into the autoregressive part of the model.
I used an embedding layer just to make the model a bit more efficient. The first operation in a GRU is a matrix multiplication with the input. So using a one-hot input picks out a column of the matrix (basically what an embedding layer does). I just separated out the embedding operation and used a smaller dimension which hopefully sped things up training a little. It should work fine if you go with the original approach though.
Hello,
In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.
So, I have two (actually three) questions:
Thank you very much
The text was updated successfully, but these errors were encountered: