Home > Blockchain >  what R Code to calculate the entropy for each level in a categorical variable
what R Code to calculate the entropy for each level in a categorical variable

Time:12-13

I have quite some categorical variable in my dataset, These variables have more than two levels each. Now i want an R code function (or loop) that can calculate the entropy and information gain for each levels in each categorical variable and return the lowest entropy and highest information gain.

data <- list(buys = c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"),credit = c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent"),student = c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no"),income = c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium"),age = c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44))
data<- as.data.frame(data)

Above is a sample dataframe

entropy_tab <- function(x) { tabfun2 <- prop.table(table(data[,x],training_credit_Risk[,13])   1e-6, margin = 1)sum(prop.table(table(data[,x]))*rowSums(-tabfun2*log2(tabfun2)))}

Above function calculates entropy for each variable, i want a fuction to calculate the contribution to the entropy for each level? i.e the contribution of "excellent" and "fair" to the entropy of "Credit"

CodePudding user response:

You have to modify your function to have two inputs, the variable you want and the level of the variable. Inside the function you then have to subset based on the level of the variable you want. I then use mapply to loop through the variable credit and each of its levels.

entropy_tab <- function(x,y) { 
  tabfun2 <- prop.table(table(data[,x][data[,x] == y] ,data[,5][data[,x]==y])   1e-6, margin = 1)
sum(prop.table(table(data[,x][data[,x] == y]))*rowSums(-tabfun2*log2(tabfun2)))
}


x <- mapply(entropy_tab, c("credit","credit"), unique(data$credit))

names(x) <- unique(data$credit)

#checks
entropy_tab("credit","excellent")
entropy_tab("credit","fair")

CodePudding user response:

In measure theory, the expected surprisal of an event A in a measure space with measure mu is

-mu(A)log(mu(A))

And so the entropy is the sum over all events of the expected surprisal. So what you're looking for is the expected surprisal of each level of each variable.

Note you won't be able to express the surprisal of a data frame as a data frame, as each variable in the data frame has a different number of variables.

You can do

exp_surprisal <- function(x, base=exp(1)) {
  t <- table(x)
  freq <- t/sum(t)
  ifelse(freq==0, 0, -freq * log(freq, base))
}

And then

lapply(data, exp_surprisal)

gives

$buys
x
       no       yes 
0.3677212 0.2840353 

$credit
x
excellent      fair 
0.3631277 0.3197805 

$student
x
       no       yes 
0.3465736 0.3465736 

$income
x
     high       low    medium 
0.3579323 0.3579323 0.3631277 

$age
x
       23        25        26        27        29        33        35        36        37        41        42        44        45        48 
0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 

Note you can also define

entropy <- function(x) sum(exp_surprisal(x))

to get the entropy.

Then

lapply(data, entropy)

gives

$buys
[1] 0.6517566

$credit
[1] 0.6829081

$student
[1] 0.6931472

$income
[1] 1.078992

$age
[1] 2.639057
  • Related