I have dataframe with 100 columns. I want to filter those columns based on their nonuniformity. Ex. if there're columns with more than 90% (or 95% or 99%) NAs (or 0-s, or -999, or whatever value), remove them from dataframe.
I can remove with NAs, or 0-s, but the problem is I don't know what value it will be.
Ex. of removing NAs with more than 90% df[, which(colMeans(!is.na(df)) > 0.9)]
CodePudding user response:
I would simply use table
to count the number of occurence of each value, when the maximum of these values exceeds the needed threshold you can discard the column.
In the following toy example, x
, y
and z
are "constant". For x
there are 96% of NA
values, for y
there are 99% of 0
and for z
there are 97% of -1
(but any value would work).
set.seed(26012023)
df <- data.frame(w = rnorm(100), x = c(rep(NA, 96), rnorm(4)), y = c(rep(0, 99), rnorm(1)),
z = c(rep(-1, 97), rnorm(3)))
apply(df, 2, function(x, cutoff = .95) {
tab <- table(x, useNA = "ifany")
max_val <- max(tab)
max_val >= cutoff * length(x)
})
# w x y z
# FALSE TRUE TRUE TRUE
CodePudding user response:
We can create a toy example, defining the following data.frame
named df
# Seed to make it reproducible
set.seed(12345)
df <- data.frame(cbind(Var1 = c(rep(10,19),1),
Var2 = sample(letters[1:5],20, prob = c(0.8,0.1,0.5,0.25,0.25), replace = T),
Var3 = sample(c("Yes","No"), 20, prob = c(.95, .05), replace = T),
Var4 = sample(1:3, 20, replace = T),
Var5 = c(rep(NA,15),rep(1,5))))
Then we compute the maximum frequency of a single value for each colum and finally we delet those that exceed the required value
# Calculate the maximum frequency for a single value for each column
aux <- apply(df,2,function(x) max(prop.table(table(x, useNA = "ifany"))))
# Define new.df as df whithout the columns that have a value more than a 90% of times
new.df <- df[,-which(aux>.9)]