Home > Blockchain >  Add new value to table() in order to be able to use chi square test
Add new value to table() in order to be able to use chi square test

Time:01-08

From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".

How can I add to the dataset with less value the missing value in order to be able to use chi square test?

Example:

I want to use chi square on a the same feature in the two dataset:

chisq.test(table(df1$var1, df2$var1))

but I get the error "all arguments must have the same length" because table(df1$var1) is:

a  b  c  d
2  5  7  18

while table(df2$var1) is:

a  b  c
8  1  12

so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.

CodePudding user response:

The table output of df2 can be modified if we convert to factor with levels specified

table(factor(df2$var1, levels = letters[1:4]))

 a  b  c  d 
 8  1 12  0 

But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table

library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
   var1
grp  a  b  c  d
  1  2  5  7 18
  2  8  1 12  0

Or in base R

table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))), 
  col2 = c(df1$var1, df2$var1)))
    col2
col1  a  b  c  d
   1  2  5  7 18
   2  8  1 12  0

data

df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c", 
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d", 
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame", 
row.names = c(NA, 
-32L))

df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
 "a", 
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
  • Related