Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Topic modelling #188

Open
wants to merge 106 commits into
base: master
Choose a base branch
from
Open

Conversation

DonggeLiu
Copy link

No description provided.

DonggeLiu and others added 30 commits June 29, 2017 10:43
tokenize articles
…r yet)

2. A path helper to assit import
3. modified token_pool to make it compatible with LDA model
1. Made every variable and method priavte if possible
2. Reformatted code with Pycharm shortcut
3. Added tests for TokenPool (works well) and ModelGensim (does work due to 'no module named XXX' problem when model_gensim is calling its abstract parent)
4. Decoupled token_pool and model_*
5. Used if __name__ == '__main__' to give a simple demonstration on how to use each mehtod

Model_*
1. Renamed mode_lda.py and model_lda2.py to model_gensim.py (which uses the Gensim package) and model_lda.py (which uses the LDA package)
2. Added a abstract parent class TopicModel.py
3. Moved some code from summarise() to add_stories() (a. better structure of code; b. improved performance)
4. Changed some constants to function arguments (e.g. total_topic_num, iteration_num, etc.)

TokenPool
1. Added mc_root_path() when locating the stopwords file
2. Modified query in token pool:
	1. added "INNER JOIN stories WHERE language='en'" to guarantee all stories are in English
	2. added "LIMIT" and corresponding "SELECT DISTINCT ... ORDER BY..." to guarantee only fetch the required number of stroies (thus improves performance)
	3. added "OFFSET"
3. Restructured token_pool.py, so that the stories are traversed only once (thus improves performance)
4. Decoupled DB from token_pool.py
5. Replace regex tokenization with nltk.tokenizer
6. Added nltk.stem.WordNetLemmatizer to lemmatize (which gives a better result than stemming) tokens
The result of this algorithm is similar but slightly different from the LDA model
+
It allows multiple topics for each story
2. renamed a few methods/variables due to the change of functionalities
…ter efficiency and performance

I will combine these two later
This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)
2. improve performance based on empirical results
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants