Home > Back-end >  Remove columns with uniform values
Remove columns with uniform values

Time:01-27

I have dataframe with 100 columns. I want to filter those columns based on their nonuniformity. Ex. if there're columns with more than 90% (or 95% or 99%) NAs (or 0-s, or -999, or whatever value), remove them from dataframe.

I can remove with NAs, or 0-s, but the problem is I don't know what value it will be. Ex. of removing NAs with more than 90% df[, which(colMeans(!is.na(df)) > 0.9)]

CodePudding user response:

I would simply use table to count the number of occurence of each value, when the maximum of these values exceeds the needed threshold you can discard the column.

In the following toy example, x, y and z are "constant". For x there are 96% of NA values, for y there are 99% of 0 and for z there are 97% of -1 (but any value would work).

set.seed(26012023)
df <- data.frame(w = rnorm(100), x = c(rep(NA, 96), rnorm(4)), y = c(rep(0, 99), rnorm(1)),
                 z = c(rep(-1, 97), rnorm(3)))

apply(df, 2, function(x, cutoff = .95) {
   tab <- table(x, useNA = "ifany")
   max_val <- max(tab)
   max_val >= cutoff * length(x)
})

#     w     x     y     z 
# FALSE  TRUE  TRUE  TRUE 

CodePudding user response:

We can create a toy example, defining the following data.frame named df

# Seed to make it reproducible
  set.seed(12345)

  df <- data.frame(cbind(Var1 = c(rep(10,19),1),
                         Var2 = sample(letters[1:5],20, prob = c(0.8,0.1,0.5,0.25,0.25), replace = T),
                         Var3 = sample(c("Yes","No"), 20,  prob = c(.95, .05), replace = T),
                         Var4 = sample(1:3, 20, replace = T),
                         Var5 = c(rep(NA,15),rep(1,5))))

Then we compute the maximum frequency of a single value for each colum and finally we delet those that exceed the required value

# Calculate the maximum frequency for a single value for each column
  aux <- apply(df,2,function(x) max(prop.table(table(x, useNA = "ifany"))))     
# Define new.df as df whithout the columns that have a value more than a 90% of times
  new.df <- df[,-which(aux>.9)]
  • Related