-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path_DecisionTreeKararAgaci.Rmd
399 lines (249 loc) · 13.9 KB
/
_DecisionTreeKararAgaci.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
# My R Codes For Data Analysis
# Decision Trees
```
{r eval=FALSE, include=FALSE, echo=TRUE}
# install.packages( ISLR )
library(ISLR)
data(package = ISLR )
carseats <- Carseats
carseats
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
# install.packages( tree )
library(tree)
require(tree)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
names(carseats)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
hist(carseats$Sales)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
High <- ifelse(carseats$Sales <= 8, No , Yes )
carseats <- data.frame(carseats, High)
carseats
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
tree.carseats <- tree::tree(High~.-Sales, data = carseats)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
tree.carseats
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
set.seed(101)
train <- sample(1:nrow(carseats), 250)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
train
```
```
{r eval=FALSE, fig.height=6, fig.width=12, include=FALSE}
tree.carseats <- tree(High~.-Sales, carseats, subset=train)
plot(tree.carseats)
text(tree.carseats, pretty=0)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
tree.pred <- predict(tree.carseats, carseats[-train,], type = class )
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
tree.pred
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
with(carseats[-train,], table(tree.pred, High))
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
cv.carseats <- cv.tree(tree.carseats, FUN = prune.misclass)
cv.carseats
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
plot(cv.carseats)
```
```
prune.carseats = prune.misclass(tree.carseats, best = 12)
plot(prune.carseats)
text(prune.carseats, pretty=0)
It's a bit shallower than previous trees, and you can actually read the labels. Let's evaluate it on the test dataset again.
tree.pred = predict(prune.carseats, carseats[-train,], type= class )
with(carseats[-train,], table(tree.pred, High))
(74 + 39) / 150
Seems like the correct classifications dropped a little bit. It has done about the same as your original tree, so pruning did not hurt much with respect to misclassification errors, and gave a simpler tree.
Often case, trees don't give very good prediction errors, so let's go ahead take a look at random forests and boosting, which tend to outperform trees as far as prediction and misclassification are concerned.
Random Forests
For this part, you will use the Boston housing data to explore random forests and boosting. The dataset is located in the MASS package. It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census.
library(MASS)
data(package= MASS )
boston<-Boston
dim(boston)
names(boston)
Let's also load the randomForest package.
require(randomForest)
To prepare data for random forest, let's set the seed and create a sample training set of 300 observations.
set.seed(101)
train = sample(1:nrow(boston), 300)
In this dataset, there are 506 surburbs of Boston. For each surburb, you have variables such as crime per capita, types of industry, average # of rooms per dwelling, average proportion of age of the houses etc. Let's use medv - the median value of owner-occupied homes for each of these surburbs, as the response variable.
Let's fit a random forest and see how well it performs. As being said, you use the response medv, the median housing value (in $1K dollars), and the training sample set.
rf.boston = randomForest(medv~., data = boston, subset = train)
rf.boston
Printing out the random forest gives its summary: the # of trees (500 were grown), the mean squared residuals (MSR), and the percentage of variance explained. The MSR and % variance explained are based on the out-of-bag estimates, a very clever device in random forests to get honest error estimates.
The only tuning parameter in a random Forests is the argument called mtry, which is the number of variables that are selected at each split of each tree when you make a split. As seen here, mtry is 4 of the 13 exploratory variables (excluding medv) in the Boston Housing data - meaning that each time the tree comes to split a node, 4 variables would be selected at random, then the split would be confined to 1 of those 4 variables. That's how randomForests de-correlates the trees.
You're going to fit a series of random forests. There are 13 variables, so let's have mtry range from 1 to 13:
In order to record the errors, you set up 2 variables oob.err and test.err.
In a loop of mtry from 1 to 13, you first fit the randomForest with that value of mtry on the train dataset, restricting the number of trees to be 350.
Then you extract the mean-squared-error on the object (the out-of-bag error).
Then you predict on the test dataset (boston[-train]) using fit (the fit of randomForest).
Lastly, you compute the test error: mean-squared error, which is equals to mean( (medv - pred) ^ 2 ).
oob.err = double(13)
test.err = double(13)
for(mtry in 1:13){
fit = randomForest(medv~., data = boston, subset=train, mtry=mtry, ntree = 350)
oob.err[mtry] = fit$mse[350]
pred = predict(fit, boston[-train,])
test.err[mtry] = with(boston[-train,], mean( (medv-pred)^2 ))
}
Basically you just grew 4550 trees (13 times 350). Now let's make a plot using the matplot command. The test error and the out-of-bag error are binded together to make a 2-column matrix. There are a few other arguments in the matrix, including the plotting character values (pch = 23 means filled diamond), colors (red and blue), type equals both (plotting both points and connecting them with the lines), and name of y-axis (Mean Squared Error). You can also put a legend at the top right corner of the plot.
matplot(1:mtry, cbind(test.err, oob.err), pch = 23, col = c( red , blue ), type = b , ylab= Mean Squared Error )
legend( topright , legend = c( OOB , Test ), pch = 23, col = c( red , blue ))
Ideally, these 2 curves should line up, but it seems like the test error is a bit lower. However, there's a lot of variability in these test error estimates. Since the out-of-bag error estimate was computed on one dataset and the test error estimate was computed on another dataset, these differences are pretty much well within the standard errors.
Notice that the red curve is smoothly above the blue curve? These error estimates are very correlated, because the randomForest with mtry = 4 is very similar to the one with mtry = 5. That's why each of the curves is quite smooth. What you see is that mtry around 4 seems to be the most optimal choice, at least for the test error. This value of mtry for the out-of-bag error equals 9.
So with very few tiers, you have fitted a very powerful prediction model using random forests. How so? The left-hand side shows the performance of a single tree. The mean squared error on out-of-bag is 26, and you've dropped down to about 15 (just a bit above half). This means you reduced the error by half. Likewise for the test error, you reduced the error from 20 to 12.
Boosting
Compared to random forests, boosting grows smaller and stubbier trees and goes at the bias. You will use the package GBM (Gradient Boosted Modeling), in R.
require(gbm)
GBM asks for the distribution, which is Gaussian, because you'll be doing squared error loss. You're going to ask GBM for 10,000 trees, which sounds like a lot, but these are going to be shallow trees. Interaction depth is the number of splits, so you want 4 splits in each tree. Shrinkage is 0.01, which is how much you're going to shrink the tree step back.
boost.boston = gbm(medv~., data = boston[train,], distribution = gaussian , n.trees = 10000, shrinkage = 0.01, interaction.depth = 4)
summary(boost.boston)
The summary function gives a variable importance plot. It seems like there are 2 variables that have high relative importance: rm (number of rooms) and lstat (percentage of lower economic status people in the community). Let's plot these 2 variables:
plot(boost.boston,i= lstat )
plot(boost.boston,i= rm )
The 1st plot shows that the higher the proportion of lower status people in the suburb, the lower the value of the housing prices. The 2nd plot shows the reversed relationship with the number of rooms: the average number of rooms in the house increases as the price increases.
It's time to predict a boosted model on the test dataset. Let's look at the test performance as a function of the number of trees:
First, you make a grid of number of trees in steps of 100 from 100 to 10,000.
Then, you run the predict function on the boosted model. It takes n.trees as an argument, and produces a matrix of predictions on the test data.
The dimensions of the matrix are 206 test observations and 100 different predict vectors at the 100 different values of tree.
n.trees = seq(from = 100, to = 10000, by = 100)
predmat = predict(boost.boston, newdata = boston[-train,], n.trees = n.trees)
dim(predmat)
It's time to compute the test error for each of the predict vectors:
predmat is a matrix, medv is a vector, thus (predmat - medv) is a matrix of differences. You can use the apply function to the columns of these square differences (the mean). That would compute the column-wise mean squared error for the predict vectors.
Then you make a plot using similar parameters to that one used for Random Forest. It would show a boosting error plot.
boost.err = with(boston[-train,], apply( (predmat - medv)^2, 2, mean) )
plot(n.trees, boost.err, pch = 23, ylab = Mean Squared Error , xlab = # Trees , main = Boosting Test Error )
abline(h = min(test.err), col = red )
The boosting error pretty much drops down as the number of trees increases. This is an evidence showing that boosting is reluctant to overfit. Let's also include the best test error from the randomForest into the plot. Boosting actually gets a reasonable amount below the test error for randomForest.
Conclusion
So that's the end of this R tutorial on building decision tree models: classification trees, random forests, and boosted trees. The latter 2 are powerful methods that you can use anytime as needed. In my experience, boosting usually outperforms RandomForest, but RandomForest is easier to implement. In RandomForest, the only tuning parameter is the number of trees; while in boosting, more tuning parameters are required besides the number of trees, including the shrinkage and the interaction depth.
If you would like to learn more, be sure to take a look at our Machine Learning Toolbox course for R.
```
# decision tree
https://analytics4all.org/2016/11/23/r-decision-trees-regression/
# DECISION TREE CLASSIFIER IMPLEMENTATION IN R
https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
https://dataaspirant.com/2017/02/03/decision-tree-classifier-implementation-in-r/
# caret
Classification And REgression Training
```
{r eval=FALSE, include=FALSE, echo=TRUE}
library(caret)
library(rpart.plot)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
data_url <- c( https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data )
download.file(url = data_url, destfile = data/car.data )
car_df <- read.csv( data/car.data , sep = ',', header = FALSE)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
#check dimensions of train & test set
dim(training); dim(testing);
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
anyNA(car_df)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
summary(car_df)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
trctrl <- trainControl(method = repeatedcv , number = 10, repeats = 3)
# The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this tutorial, let’s try to use repeatedcv i.e, repeated cross-validation.
#
# The “number” parameter holds the number of resampling iterations. The “repeats ” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = rpart ,
parms = list(split = information ),
trControl=trctrl,
tuneLength = 10)
# train() method should be passed with “method” parameter as “rpart”. There is another package “rpart”, it is specifically available for decision tree implementation. Caret links its train function with others to make our work simple.
#
# We are passing our target variable V7. The “V7~.” denotes a formula for using all attributes in our classifier and V7 as the target variable. The “trControl” parameter should be passed with results from our trianControl() method.
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
?rpart
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
dtree_fit
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
prp(dtree_fit$finalModel, box.palette = Reds , tweak = 1.2)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
testing[1,]
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
predict(dtree_fit, newdata = testing[1,])
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
test_pred <- predict(dtree_fit, newdata = testing)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
confusionMatrix(test_pred, testing$V7 ) #check accuracy
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
set.seed(3333)
dtree_fit_gini <- train(V7 ~., data = training, method = rpart ,
parms = list(split = gini ),
trControl=trctrl,
tuneLength = 10)
dtree_fit_gini
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
prp(dtree_fit_gini$finalModel, box.palette = Blues , tweak = 1.2)
```
```
{r eval=FALSE, include=FALSE, echo=TRUE}
test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy
```