Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking for the different algorithms #26

Open
assem-ch opened this issue Mar 5, 2018 · 4 comments
Open

Benchmarking for the different algorithms #26

assem-ch opened this issue Mar 5, 2018 · 4 comments

Comments

@assem-ch
Copy link
Owner

assem-ch commented Mar 5, 2018

No description provided.

@sneetsher
Copy link

sneetsher commented Jun 1, 2018

@assem-ch , It will be much work for one man. I see many separate reports for different words, it is not practical to follow each manually. Stemmer is never be 100% perfect as manual learned method by human.

  • Better to switch to coverage based tests, using manually processed data. At least using word root/derivatives without attached pronouns & conjunctions which may be already available in the web with free use license. (Ex. harfbuzz coverage testing was done on Wikipedia data rendered in multiple browsers. https://github.com/harfbuzz/harfbuzz-testing-wikipedia). Possibly, I can implement them if you could find for me some data-sets on the web, I can use/crap them.
  • We merge all reports about specific words in one or multiple reports (each report for one algorithms and one version). I can help in this, just write short, debugging/reporting instructions in readme file. like reporting release number and how can user know which algorithm is used.

@sneetsher
Copy link

Btw, not all release packages have mentioned version.

@sneetsher
Copy link

sneetsher commented Jun 1, 2018

@assem-ch If you think the Arabic stemmer is too worthy for Alfanous & too many Arabic project that I put considerable time in it.

I will see If I go with separate project for stop-word list and derivatives list build by crowd-sourced verification to get high quality test data for the stemmer. Because, I see few arabic stop-words list and basically few persons efforts.

All depend on the way how to reach the interested/effective Arab community.

@assem-ch
Copy link
Owner Author

assem-ch commented Jun 5, 2018

@sneetsher there is already a project for testing data https://github.com/ibnmalik/golden-corpus-arabic, it can be exposed with the stemmer to get new suggests from users.

This is my phd project and I should really focus on it... I will work on a demo/review web app for it to welcome feedback and improve the visibility.

For alfanous its not too worthy, but it fix a gap for stemming words that dont exist in quran exactly but in other forms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants