-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
low accuracy #16
Comments
How can I improve it? |
What do you mean by "using WMT16 as training data"? I tried the unsupervised command in this repo on FastText embeddings of EN and DE last time, it works well. At least some over 50% accuracy on MUSE EN-DE bilingual dictionary. |
I mean use wmt16 corpus to train word2vec. I have found my bug and got 40% accuracy on MUSE EN-DE test dictionary. What's your training corpus?Is FastText better than word2vec? |
I didn't train my own embeddings. I used FastText pre-trained embeddings. |
Word2vec embeddings are purely co-occurrence-based, whereas fasttext embeddings additionally take into account character information. |
@artetxem I've used a similar approach using ELMO word embedding. I have two almost identical vocab files in English which I extracted their embeddings using ELMO. I just wanted to try out this library and see how it find matches between these two almost identical files as follows: python3 map_embeddings.py --identical SRC.EMB SEMI-SRC.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB And then tried out to find the similarities of a few simple english words like (was, she, is, the) using the shared embeddings by this command: but the accuracy was 0.0% for me!! Also, another question, why the resulting shared embeddings for target embedding has the same words as the SRC.EMB embedding file? I'm not sure how we can use the TRG_MAPPED.EMB file for instance for a Dutch text if it contains the same words from SRC.EMB (in English). I think I'm missing something, here. |
I get only 10% accuracy on EN-DE using WMT16 as training data.
The identical and unsupervised method does not differ much.
The text was updated successfully, but these errors were encountered: