Classification tree to the housing data using the R package rpart

The aim of this project is to fit a classification tree to the housing data using the R package rpart. So, I begin by loading the data, and indicating which variables are categorical (note that some of them are ordinal variables, and we should treat them as such!)

library(rpart)
info=c("Y","sex","marital_status","age","education","occupation","annual_income","length",
"dual_incomes","size_household","nb_minors","householder_status","ethnic","language")
X=read.table("Housetype_Data.csv",sep=",",header=F)
Y=X[,1]
X=data.frame(X)
colnames(X)=info
X$Y<-as.factor(X$Y)
X$sex<-as.factor(X$sex)
X$marital_status<-as.factor(X$marital_status)
X$occupation<-as.factor(X$occupation)
X$dual_incomes<-as.factor(X$dual_incomes)
X$ethnic<-as.factor(X$ethnic)
X$householder_status<-as.factor(X$householder_status)
X$language<-as.factor(X$language)
#### The goal is too infer Y from X by fitting an optimal classification tree

Let's begin by analyzing the data: nb of missing values, spread, etc (just to get the feel of the data)

##### We begin by analyzing the data: nb of missing values, spread, etc
K= 5 ## number of possible classes for the housing
summary(X)

# grow tree
fit <- rpart(Y~ .,method="class", data=X,cp=0) ### Y vs X (the rest of the variables)
### This method does exactly what we want: it gives back a classification tree according to the formula given
### The complexity paramter
plotcp(fit) # visualize cross-validation results

### Prune the tree: select the CP that is 1MSE above the cp-value that minimizes the misclassification error
# get index of CP with lowest xerror
opt_cp <- which.min(fit$cptable[,'xerror'])
candidate_cp<- which(fit$cptable[,'xerror']>(fit$cptable[opt_cp,'xerror']+fit$cptable[opt_cp,'xstd']))
cp_mse_ind<- min(which(fit$cptable[,'xerror']<(fit$cptable[opt_cp,'xerror']+fit$cptable[opt_cp,'xstd'])))
cp_mse=fit$cptable[cp_mse_ind,'CP']
#get its value
#prune tree
pruned_model <- prune(fit,cp_mse)
### as input
printcp(pruned_model) # display the results in terms of complexity paramter

#plot tree
library(rpart.plot)

Warning: package rpart.plot was built under R version 3.3.2

rpart.plot(pruned_model, uniform=TRUE,main="Classification Tree for Household_Data")

We note that:

• class 2 (Condominium) and 4 (mobile home) are not predicted by our classification tree. This is due to the fact that these are relatively rare classes, which account for only respectively 7.5 and 1.8 % of the observations.

• The most important variable in our case if the householder status (own, rent, live with parent)

• The misclassification cross-validated error rate is roughly 0.409 x 0.636 = 0.2601

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
housetype_data.csv		housetype_data.csv
rpart.r		rpart.r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification tree to the housing data using the R package rpart

About

Releases

Packages

Languages

rajeshidumalla/Classification-tree-to-the-housing-data-using-the-R-package-rpart

Folders and files

Latest commit

History

Repository files navigation

Classification tree to the housing data using the R package rpart

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages