To avoid conflicts with other projects or system-wide Python packages, it's recommended to set up a virtual environment for this project. Here's how to do it:
- Python 3.x (Ensure Python 3 is installed on your system.)
-
Navigate to Your Project Directory: Open a terminal or command prompt and navigate to the root directory of this project.
-
Create a Virtual Environment: Run the following command to create a virtual environment named
env
(you can choose any name you prefer):python -m venv env
This command creates a new directory
env
within your project where all dependencies will be installed. -
Activate the Virtual Environment:
- On Windows, run:
.\env\Scripts\activate
- On macOS or Linux, run:
source env/bin/activate
To run the scripts, you need to install the dependencies. Follow the steps below to set up your environment.
- Python 3.x (Make sure Python 3 is installed on your system.)
- Ensure Python 3.x is installed.
- Install Requirements:
pip install -r requirements.txt
- Install PyTorch: It's recommended that you use the GPU version (CUDA) of PyTorch, visit the PyTorch Get Started page, select your preferences, and run the provided installation command.
This section provides instructions for using the machine translation scripts included in this project: translate_google.py
and translate_mideind.py
. These scripts are used for translating text data into Icelandic for sentiment analysis.
translate_google.py
is a Python script for translating text data using Google's translation service. It translates reviews from the "IMDB-Dataset.csv"
file located in the Datasets
directory and saves the translated text in a new file. The script uses multithreading to enhance performance and includes error handling for translation failures.
- Python 3.x
- Pandas library
googletrans
version 3.1.0a0- Other dependencies:
concurrent.futures
,threading
,logging
-
Ensure the
"IMDB-Dataset.csv"
file is located in theDatasets
directory. -
Run the script:
python src/translate_google.py
-
The script will translate the data and output two files in the
Datasets
directory:IMDB-Dataset-GoogleTranslate.csv
: Contains translated reviews and sentiments.failed-IMDB-Dataset-GoogleTranslate.csv
: Logs failed translation attempts.
To use a different dataset:
- Place your CSV dataset in the
Datasets
directory. - The dataset should have 'review' and 'sentiment' columns.
- Modify the script if your dataset columns have different names.
- Modify the script's
dataset
variable to match your dataset's filename.
translate_mideind.py
is a Python script for translating text data using the "mideind/nmt-doc-en-is-2022-10"
model. It translates reviews from the "IMDB-Dataset.csv"
file in the Datasets
directory and saves the translated text in a new file.
- Python 3.x
- PyTorch
transformers
libraryPandas
library- Other dependencies:
re
,logging
- If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.
-
Run the script:
python src/translate_mideind.py
-
The script will process the data and output two files in the
Datasets
directory:IMDB-Dataset-MideindTranslate.csv
: Contains translated reviews and sentiments.failed-IMDB-Dataset-MideindTranslate.csv
: Logs failed translation attempts.
To use a different dataset:
- Place your CSV dataset in the
Datasets
directory. - The dataset should have 'review' and 'sentiment' columns.
- Modify the script if your dataset columns have different names.
- Modify the script's
dataset
variable to match your dataset's filename.
This section provides instructions for using the process.py
script, which performs text normalization and preprocessing for Icelandic text using IceNLP.
- Python 3.x
- Pandas library
- IceNLP tool (https://github.com/hrafnl/icenlp)
- Other dependencies:
multiprocessing
,os
,string
,sys
,time
,tkinter
,re
,joblib
,nefnir
- Download IceNLP from IceNLP GitHub Repository and extract it.
- Run the script:
python src/process.py
- When prompted, select the
icetagger.bat
file located in the extracted IceNLP directory (IceNLP-1.5.0\IceNLP\bat\icetagger
). - Ensure the dataset file (
IMDB-Dataset-MideindTranslate.csv
) is located in theDatasets
directory relative to the script. - The script will process the dataset and output the processed data to
Datasets/IMDB-Dataset-MideindTranslate-processed-nefnir.csv
.
To use a different dataset:
- Place your CSV dataset in the
Datasets
directory. - The dataset should have 'review' and 'sentiment' columns.
- Modify the
dataset_path
variable in the script to match your dataset's filename.
This section provides instructions for using the process_eng.py
script, which performs text normalization and preprocessing for English text.
- Python 3.x
- Pandas library
- NLTK library
- Other dependencies:
os
,time
,re
,joblib
- Download necessary NLTK data:
python -m nltk.downloader punkt stopwords wordnet
- Ensure the dataset file (
IMDB-Dataset.csv
) is located in theDatasets
directory. - Run the script:
python src/process_eng.py
- The script will process the dataset and output the processed data to
Datasets/IMDB-Dataset-Processed.csv
.
To use a different dataset:
- Place your dataset in the
Datasets
directory. - The dataset should be in CSV format with a 'review' column.
- Modify the
dataset_path
variable in the script to match your dataset's filename.
This section provides instructions for using the BaselineClassifiersBinary.ipynb
script, which trains SVC, Logistic Regression and Naive Bayes on English, Icelandic Google and Icelandic Miðeind datasets, it also generates classification reports for each model.
- Python 3.x
- PyTorch
- Pandas library
- Scikit-learn library
- Other dependencies:
os
,time
,numpy
Go into BaselineClassifiersBinary.ipynb
and run the cells. You have to change the ICELANDIC_GOOGLE_CSV
, ICELANDIC_MIDEIND_CSV
and ENGLISH_CSV
variables to point to the correct datasets. The cell will train and print out the classification reports for each model. It will also show a diagram. You can refer to the next cell if you want to print out the most important features, although this is not necessary.
This section provides instructions for using the train.py
script, which trains a transformer model for sentiment analysis.
- Python 3.x
- Transformers library
- PyTorch
- Pandas library
- Scikit-learn library
- Other dependencies:
os
,time
,numpy
- If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.
- Place the dataset file (default:
"IMDB-Dataset-GoogleTranslate.csv"
) in theDatasets
directory relative to the script. - Modify the script if you want to use a different pre-trained model or dataset.
- Run the script:
python src/train.py
- The script will train the model using the specified dataset and save the trained model and tokenizer in the
Models
directory.
To use a different dataset:
- Place your dataset in the
Datasets
directory. - The dataset should be in CSV format with 'review' and 'sentiment' columns.
- Modify the
dataset_path
variable in the script to match your dataset's filename.
This section provides instructions for using the generate_report.ipynb
script, which generates a classification report for a trained model. See the pre-trained transformer model at huggingface: https://huggingface.co/Birkir/electra-base-igc-is-sentiment-analysis-google-translate
This is useful mostly for the transformer models, as the baseline classifiers generate their own reports via the same libraries.
This function will call the model and generate a classification report for the model. What it expects is the path to a folder of the model, the device to use, the pandas columns to use as X and y, and whether to return the accuracy or the classification report.
- Import generate_classification_report.py
import generate_classification_report as gcr
- Load the CSV file with the data to be tested
df = pd.read_csv('IMDB-Dataset-GoogleTranslate.csv')
- Invoke the function call call_model, which takes the parameters
- X_all: All review columns
- y_all: All sentiment columns
- model: The model to be used (This is a path to a file, something like
./electra-base-google-batch8-remove-noise-model/
) or the path to huggingfaceBirkir/electra-base-igc-is-sentiment-analysis-google-translate
) - device: The device to be used (CUDA, cpu)
- accuracy: Whether to return accuracy or return a classification report
Example of how to generate a report can be seen in generate_report.ipynb
- also the generate_classification_report.py
eval_files()
function, which is loading multiple models.
https://github.com/olafurjohannsson/sentiment-analysis/tree/main
https://huggingface.co/Birkir/electra-base-igc-is-sentiment-analysis-google-translate
MIT
Ólafur Aron Jóhannsson
Eysteinn Örn
Birkir Arndal