Consider data frame below
df <- data.frame(a=c("Y", "Y","N", "Y", "N", "N"),
b = c(200, 1,1.4,1.3,2,1.6),
c = c(200,-200,10,12,14,15),
d = c("f","f","m", "m","m","m"))
a b c d
1 Y 200.0 200 f
2 Y 1.0 -200 f
3 N 1.4 10 m
4 Y 1.3 12 m
5 N 2.0 14 m
6 N 1.6 15 m
I want to trim data frame such that rows with values less than 1 percentile and greater than 99 percentile from the numeric columns are removed.
a b c d
1 N 1.4 10 m
2 Y 1.3 12 m
3 N 2.0 14 m
4 N 1.6 15 m
I can remove top and bottom undesired values, when categorical variables are not present.
df %>% dplyr::select(is.numeric) %>%
filter_all(all_vars(between(., quantile(., .01), quantile(., .99))))
but I do not know how to do the job while keeping categorical columns. any help or hint with is appreciated.
CodePudding user response:
We could use if_all
in filter
and select the columns that are numeric
with where(is.numeric)
library(dplyr)
df %>%
filter(if_all(where(is.numeric),
~ between(.x, quantile(.x, .01), quantile(.x, .99))))
-output
a b c d
1 N 1.4 10 m
2 Y 1.3 12 m
3 N 2.0 14 m
4 N 1.6 15 m
CodePudding user response:
Why do you need to check for data type? you can filter by number of rows since it is a quantile.
df[findInterval(1:nrow(df), quantile(1:nrow(df),c(.01, 0.99)))==1,]