Home > Software design >  How to remove top and bottom percentile values when both categorical and numerical columns exist in
How to remove top and bottom percentile values when both categorical and numerical columns exist in

Time:11-08

Consider data frame below

df <- data.frame(a=c("Y", "Y","N", "Y", "N", "N"),
                 b = c(200, 1,1.4,1.3,2,1.6),
                 c = c(200,-200,10,12,14,15),
                 d = c("f","f","m", "m","m","m"))
  a     b    c d
1 Y 200.0  200 f
2 Y   1.0 -200 f
3 N   1.4   10 m
4 Y   1.3   12 m
5 N   2.0   14 m
6 N   1.6   15 m

I want to trim data frame such that rows with values less than 1 percentile and greater than 99 percentile from the numeric columns are removed.

  a   b  c d
1 N 1.4 10 m
2 Y 1.3 12 m
3 N 2.0 14 m
4 N 1.6 15 m

I can remove top and bottom undesired values, when categorical variables are not present.

df %>% dplyr::select(is.numeric) %>%
    filter_all(all_vars(between(., quantile(., .01), quantile(., .99))))

but I do not know how to do the job while keeping categorical columns. any help or hint with is appreciated.

CodePudding user response:

We could use if_all in filter and select the columns that are numeric with where(is.numeric)

library(dplyr)
df %>%
   filter(if_all(where(is.numeric),
     ~ between(.x, quantile(.x, .01), quantile(.x, .99))))

-output

  a   b  c d
1 N 1.4 10 m
2 Y 1.3 12 m
3 N 2.0 14 m
4 N 1.6 15 m

CodePudding user response:

Why do you need to check for data type? you can filter by number of rows since it is a quantile.

df[findInterval(1:nrow(df), quantile(1:nrow(df),c(.01, 0.99)))==1,]
  • Related