I have a 1000 X 20 dataframe named newdf
. The goal is to go through every column and get the values within the 2.5% and 97.5% interval for every column. After that, if any feature has a value that goes outside of those intervals, we remove the ENTIRE row, regardless if any of the other features in that row are within that interval.
So far, I have been able to create a for loop that stores all of the intervals into a list like so
for(i in 1:20){
quant <- quantile(newdf[,i], c(.025, .975))
new_list[[i]] <- quant
}
I need help finding away to apply these intervals over the 20 columns to then remove the rows.
I have been trying with if()
else()
functions with no success.
CodePudding user response:
library(purrr)
idx_to_remove <- map_dfc(df, function(x) {
# for each column get interval
interval <- quantile(x, c(0.025, 0.975))
# generate boolean whether cell within interval
!(x >= interval[1] & x <= interval[2])
}) %>%
# for each row see if any TRUE
apply(1, any)
# remove these rows
df[-idx_to_remove, ]
Input Data used
set.seed(123)
df <- as.data.frame(matrix(rnorm(20 * 100), ncol = 20))
CodePudding user response:
If I understand what you are trying to do, you can do something like this:
d %>% filter(
d %>%
mutate(across(v1:v20, ~between(.x, quantile(.x,0.025), quantile(.x, 0.975)))) %>%
rowwise() %>%
summarize(keep = all(c_across(v1:v20)))
)
Here, I'm filtering d on a logical vector, which is creating using mutate(across())
, where first each v1 through v20 itself becomes a logical vector (whether or not the value in that column is within that columns 0.025 to 0.975 bounds), and then we summarize over the rows using rowwise()
and c_across()
.. Ultimately keep
is a logical vector that is being fed to the initial filter()
call.