Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low accuracy #16

Open
15091444119 opened this issue Jul 30, 2018 · 6 comments
Open

low accuracy #16

15091444119 opened this issue Jul 30, 2018 · 6 comments

Comments

@15091444119
Copy link

I get only 10% accuracy on EN-DE using WMT16 as training data.
The identical and unsupervised method does not differ much.

@15091444119
Copy link
Author

How can I improve it?

@zhangxiangnick
Copy link

What do you mean by "using WMT16 as training data"?

I tried the unsupervised command in this repo on FastText embeddings of EN and DE last time, it works well. At least some over 50% accuracy on MUSE EN-DE bilingual dictionary.

@15091444119
Copy link
Author

I mean use wmt16 corpus to train word2vec.

I have found my bug and got 40% accuracy on MUSE EN-DE test dictionary. What's your training corpus?Is FastText better than word2vec?

@zhangxiangnick
Copy link

I didn't train my own embeddings. I used FastText pre-trained embeddings.

@hassyGo
Copy link

hassyGo commented Aug 16, 2018

Word2vec embeddings are purely co-occurrence-based, whereas fasttext embeddings additionally take into account character information.
Therefore it is hard to directly compare them in general context.

@yaserkl
Copy link

yaserkl commented Aug 31, 2018

@artetxem I've used a similar approach using ELMO word embedding. I have two almost identical vocab files in English which I extracted their embeddings using ELMO. I just wanted to try out this library and see how it find matches between these two almost identical files as follows:

python3 map_embeddings.py --identical SRC.EMB SEMI-SRC.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

And then tried out to find the similarities of a few simple english words like (was, she, is, the) using the shared embeddings by this command:
python3 eval_translation.py SRC_MAPPED.EMB TRG_MAPPED.EMB -d TEST.DICT

but the accuracy was 0.0% for me!!

Also, another question, why the resulting shared embeddings for target embedding has the same words as the SRC.EMB embedding file? I'm not sure how we can use the TRG_MAPPED.EMB file for instance for a Dutch text if it contains the same words from SRC.EMB (in English). I think I'm missing something, here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants