Home > Back-end >  subset of a large data set including 2000 variables-Removing variables when just one of their level
subset of a large data set including 2000 variables-Removing variables when just one of their level

Time:09-06

I have a very large data set including 2000 variables which the majority are factors. Some variables have more than one levels but just one of their level has frequency more than 1. Suppose: Agegroup : Group 1 (Freq=200) Group2(Freq=0) Group3(Freq=0). I am looking for a loop function to check the frequency of all varibles and remove the variables when just one of their levels has non-zero frequnecy. The following function works just for one variable, but how about checking all variables in the data set?

library(dplyr)
 df1 %>% 
 group_by(ID) %>% 
 filter(n()>cutoff)

CodePudding user response:

To get rid of the levels that are empty, you can use droplevels.

I added comments in the code. If you have any questions, let me know.

df <- data.frame(x = factor(rep("x", 10), levels = c("x", "y")),
                 z = factor(c(rep("m", 5), rep("p", 5))))

summary(df)
remr <- vector(mode = "list")
invisible(lapply(1:ncol(df),
                 function(i) {
                   df <- df %>% droplevels()
                   if(length(unique(df[[i]])) < 2) {     # less than 2 unique values
                     remr <<- append(remr, names(df)[i]) # make a list
                   }                    # remove at end, so col indices doesn't change
                 }
))

df1 <- df %>% select(-unlist(remr))
summary(df1) # inspect
  • Related