Removing rows from a data frame that are outside an interval [R]-CodePudding

I have a 1000 X 20 dataframe named newdf. The goal is to go through every column and get the values within the 2.5% and 97.5% interval for every column. After that, if any feature has a value that goes outside of those intervals, we remove the ENTIRE row, regardless if any of the other features in that row are within that interval.

So far, I have been able to create a for loop that stores all of the intervals into a list like so

for(i in 1:20){
  quant <- quantile(newdf[,i], c(.025, .975))
  new_list[[i]] <- quant
  
}

I need help finding away to apply these intervals over the 20 columns to then remove the rows.

I have been trying with if() else() functions with no success.

CodePudding user response：

library(purrr)    
idx_to_remove <- map_dfc(df, function(x) {
        # for each column get interval
        interval <- quantile(x, c(0.025, 0.975))
    
        # generate boolean whether cell within interval
        !(x >= interval[1] & x <= interval[2])
    }) %>% 

    # for each row see if any TRUE
    apply(1, any)

# remove these rows
df[-idx_to_remove, ]

Input Data used

set.seed(123)
df <- as.data.frame(matrix(rnorm(20 * 100), ncol = 20))

CodePudding user response：

If I understand what you are trying to do, you can do something like this:


d %>% filter(
  d %>% 
  mutate(across(v1:v20, ~between(.x, quantile(.x,0.025), quantile(.x, 0.975)))) %>%
  rowwise() %>% 
  summarize(keep = all(c_across(v1:v20)))
)

Here, I'm filtering d on a logical vector, which is creating using mutate(across()), where first each v1 through v20 itself becomes a logical vector (whether or not the value in that column is within that columns 0.025 to 0.975 bounds), and then we summarize over the rows using rowwise() and c_across().. Ultimately keep is a logical vector that is being fed to the initial filter() call.