Decision Tree

This is a repository containing a Decision Tree classifier for documents. More information on decision tree can be found here.

The classifier is built from ground up; many existing packages provide the same Decision Tree functionality (such as sklearn). This repository is my own implementation of the model The model takes as input vectors of documents in the form of:

[feature_1_freq, feature_2_freq... feature_n_freq]

Each element in the vector corresponds to a feature (token) in the vocabulary, and the value of each element is the frequency of that feature. A vectorizer for any given document will be provided in future updates.

This implementation uses information gain as the feature selection criteria.

Usage

The main script that runs the Decision Tree classifier is run.py:

from model import DecisionTree
import build_dt

# define paths to training and testing data
TRAIN_PATH = "train.vectors.txt"
TEST_PATH = "test.vectors.txt"

Provide the paths to train and test sets before running run.py.

Setting Up Model

The model takes two arguments to control its learning. The variable MAX_DEPTH controls the number of levels of the decision tree, while MIN_GAIN controls the minimum amount of information gained from observing a feature.

In build_dt.py:

from dtnode import DTNode
from util import compute_info_gain
from model import DecisionTree

MAX_DEPTH = 50
MIN_GAIN = 0

Output

The model will output a confusion matrix and the train and test accuracies.

Confusion matrix for the test data:
row is the truth, column is the system output

             talk.politics.guns talk.politics.mideast talk.politics.misc
talk.politics.guns 900 0 0 
talk.politics.mideast 8 891 1 
talk.politics.misc 39 43 818 

 Test accuracy=0.9662962962962963

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
README.md		README.md
analysis		analysis
build_dt.py		build_dt.py
build_dt.sh		build_dt.sh
dtnode.py		dtnode.py
model.py		model.py
model_file		model_file
output		output
run.py		run.py
test.vectors.txt		test.vectors.txt
train.vectors.txt		train.vectors.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Tree

Usage

Setting Up Model

Output

About

Releases

Packages

Languages

vsoesanto/dt_classifier

Folders and files

Latest commit

History

Repository files navigation

Decision Tree

Usage

Setting Up Model

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages