From 9f80658cfe50146f965452255dc3144da3f78f99 Mon Sep 17 00:00:00 2001 From: LemmensJens Date: Fri, 12 Apr 2024 13:25:44 +0000 Subject: [PATCH] debug custop stop words widget --- app.py | 8 ++++---- topic_modeling_app.py | 10 +--------- 2 files changed, 5 insertions(+), 13 deletions(-) diff --git a/app.py b/app.py index 4634f21..3d1782d 100644 --- a/app.py +++ b/app.py @@ -210,10 +210,10 @@ def visible_plots(_): - **BERTopic** (Grootendorst, 2022): State-of-the-art neural algorithm that can be used with any pre-trained transformer model that is publicly available on the [Huggingface hub](https://huggingface.co/models). This is useful when you want to use a model that is trained on the same (or similar) data as your corpus. The default model that BERTopic uses is "all-MiniLM-L6-v2" (for English data only). Other models that can be used are, for instance, [BERTje](https://huggingface.co/GroNLP/bert-base-dutch-cased)/[RobBERT](https://huggingface.co/DTAI-KULeuven/robbert-2023-dutch-base) (Dutch), [CamemBERT](https://huggingface.co/almanach/camembert-base) (French), [German BERT](https://huggingface.co/google-bert/bert-base-german-cased), and [multi-lingual BERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) (other languages) - **NMF/LDA**: Classical machine learning algorithms, which generally perform worse than Top2Vec and BERTopic, but worth investigating when working with a large corpus that contains relatively long texts. -The selection of the embedding model used in Top2Vec or BERTopic should, a.o. depend on the language of your data. If your corpus contains texts written in multiple languages, it is recommended to use a multilingual model. It is also recommended to use a multi-lingual model if the corpus contains texts written in any language other than Dutch, English, French, or German. When no pre-trained mono- or multi-lingual model that was trained on the relevant language (or dialect / historical variant) exists, it is best to either train a new model with Top2Vec/Doc2Vec, or use a model that was pre-trained on a structurally similar language (e.g. use a Dutch model for Afrikaans). +The selection of the embedding model used in Top2Vec or BERTopic should, a.o. depend on the language of your data. If your corpus contains texts written in multiple languages, it is recommended to use a multilingual model. It is also recommended to use a multi-lingual model if the corpus contains texts written in any language other than Dutch, English, French, or German. When no pre-trained mono- or multi-lingual model that was trained on the relevant language (or dialect / historical variant) exists, it is best to either train a new model with Top2Vec using Doc2Vec to generate embeddings, or use a model that was pre-trained on a structurally similar language (e.g. use a Dutch model for Afrikaans). ### Preprocessing -When using the classical machine learning algorithms (NMF/LDA), it is recommended to apply all preprocessing steps provided in the pipeline (tokenization, lemmatization, lowercasing, and removing stopwords and punctuation). For the neural models, it is not required, since they rely on more sophisticated methods, but experimenting with different preprocessing steps could still result in improvements. Note that when selecting lemmatization, it is important to also apply tokenization. +When using the classical machine learning algorithms (NMF/LDA), it is recommended to apply all preprocessing steps provided in the pipeline (tokenization, lemmatization, lowercasing, and removing stopwords and punctuation). For the neural models, it is not required, since they rely on more sophisticated methods, but experimenting with different preprocessing steps could still result in improvements. Note that when selecting lemmatization, it is important to also apply tokenization. Note that multi-lingual preprocessing is currently not supported. ### Model parameter tuning and evaluation of the results Which model and hyperparameters are optimal depends on the data that is used. Therefore, optimization experiments are necessary to find the best configuration. To evaluate the results of the topic modeling algorithm, it is important to investigate both the quantitative results - the diversity and coherence scores - but also the qualitative results by looking at the individual topic predictions, visualizations, and the most important keywords per topic. @@ -230,10 +230,10 @@ def visible_plots(_): gr.Markdown(""" ### Project Toposcope is a topic modeling pipeline that was developed by [CLiPS](https://www.uantwerpen.be/en/research-groups/clips/) ([University of Antwerp](https://www.uantwerpen.be/en/)) during the [CLARIAH-VL](https://clariahvl.hypotheses.org/) project. - The code is available here: https://github.com/LemmensJens/CLARIAH-topic + The code is available here: https://github.com/clips/toposcope. ### Contact - If you have questions, please send them to [Jens Lemmens](mailto:jens.lemmens@uantwerpen.be) or [Walter Daelemans](mailto:walter.daelemans@uantwerpen.be) + If you have questions, please send them to [Jens Lemmens](mailto:jens.lemmens@uantwerpen.be) or [Walter Daelemans](mailto:walter.daelemans@uantwerpen.be). """) with gr.Row(): diff --git a/topic_modeling_app.py b/topic_modeling_app.py index 0fb7048..f490aac 100644 --- a/topic_modeling_app.py +++ b/topic_modeling_app.py @@ -86,15 +86,7 @@ def main( print(" Lowercase:", lowercase) print(" Remove punctuation:", remove_punct) - if remove_custom_stopwords: - with open(remove_custom_stopwords) as x: - lines = x.readlines() - custom_stopwords = set([l.strip() for l in lines]) - else: - custom_stopwords = None - tqdm.pandas() - df[column_name] = df[column_name].progress_apply(lambda x: preprocess( x, nlp, @@ -102,7 +94,7 @@ def main( tokenize, lemmatize, remove_nltk_stopwords, - custom_stopwords, + remove_custom_stopwords, remove_punct, lowercase, )