Home > Blockchain >  what's wrong with this small decision tree using C5.0?
what's wrong with this small decision tree using C5.0?

Time:11-27

I'm trying to make simple decision tree using C5.0 in R.

data has 3 columns(including target data) and 14 rows. This is my 'jogging' data. target variable is 'CLASSIFICATION'

WEATHER   JOGGED_YESTERDAY   CLASSIFICATION
C          N                   
W          Y                  -
Y          Y                  -
C          Y                  -
Y          N                  -
W          Y                  -
C          N                  -
W          N                   
C          Y                  -
W          Y                   
W          N                   
C          N                   
Y          N                  -
W          Y                  -

or as dput result:

structure(list(WEATHER = c("C", "W", "Y", "C", "Y", "W", "C", 
"W", "C", "W", "W", "C", "Y", "W"), JOGGED_YESTERDAY = c("N", 
"Y", "Y", "Y", "N", "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y"
), CLASSIFICATION = c(" ", "-", "-", "-", "-", "-", "-", " ", 
"-", " ", " ", " ", "-", "-")), class = "data.frame", row.names = c(NA, 
-14L))
jogging <- read.csv("Jogging.csv")

jogging           #training data

library(C50)
jogging$CLASSIFICATION <- as.factor(jogging$CLASSIFICATION)
jogging_model <- C5.0(jogging[-3], jogging$CLASSIFICATION)       

jogging_model
summary(jogging_model)
plot(jogging_model)

but it does not make any decision tree. I thought that it should have made 2 nodes(because of 2 columns except target variables) I want to know what's wrong :(

CodePudding user response:

For this answer I will use a different tree building package partykit just for the reason that I am more used to it. Let's do the following:

jogging <- read.table(header = TRUE, text = "WEATHER   JOGGED_YESTERDAY   CLASSIFICATION
C          N                   
W          Y                  -
Y          Y                  -
C          Y                  -
Y          N                  -
W          Y                  -
C          N                  -
W          N                   
C          Y                  -
W          Y                   
W          N                   
C          N                   
Y          N                  -
W          Y                  -",
                      stringsAsFactors = TRUE)

library(partykit)
ctree(CLASSIFICATION ~ WEATHER   JOGGED_YESTERDAY, data = jogging, 
      minsplit = 1, minbucket = 1, mincriterion = 0) |> plot()

That will print the following tree:

enter image description here

That is a tree that uses up to three levels of splits and still does not find a perfect fit. The first split has a p-value of .2, indicating that there is not nearly enough data to justify even this first split, let alone those following it. This is a tree that is very likely to massively overfit the data and overfitting is bad. That is why usual tree algorithms come with measures to prevent overfitting and in your case, that prohibits growing a tree. I disabled those with the arguments in the ctree call.

So in short: You have not enough data. Just predicting - all the time is the most reasonable thing a classification tree can do.

  • Related