Home > Software engineering >  Systematically applying multiple filters on data frame
Systematically applying multiple filters on data frame

Time:02-03

I want to systematically and one by one apply multiple filters on a data frame and get to know how many rows were removed by each filter.

Bonus points if it's possible to return how many rows would have been removed by filter n if non of the filters 1, ..., n-1 had been applied.

My current code as a starting point:

my_data <- data.frame(Criterion1 = c(1:9,1:3,3:7,2:4,6:1),
                      Criterion2 = c(9:1,3:1,2:3,1:9,5:3))

my_filters <- data.frame(Filter = c('Criterion1 != 2', 'Criterion2 != 4', 'Criterion1 != 9'),
                         Description = c('Remove all Rows with 2', 'Remove all Rows with 4', 'Remove all Rows with 9'))

lapply(my_filters, function(cur_filter) {
 my_data <- filter(my_data, cur_filter$Filter)
 cat(nrow(my_data), 'rows remaining after applying Filter "', cur_filter$Description, '"') # nicer: printing how many rows have been removed
})

(Instead of printing every time with cat, I think it would be nice to instead create a table, collect all "removing information" there and print the table at the end.)

I'm open to store my filters other than currently.

CodePudding user response:

You could write a function that takes your data frame and your filter frame, and returns a list of the final data (after all filters applied), and information on what is the effect of either isolated applicaton of the filter or hierachical application of the filter:

library(dplyr)

f <- function(df, filters) {
  static_df = df
  for(i in 1:nrow(filters)) {
    filter = filters[i, "Filter"]
    filters[i,"isolated_n_removed"] <- nrow(static_df) - nrow(filter(static_df, eval(rlang::parse_expr(filter)))) 
    filters[i, "hierarchical_n_removed"] <- nrow(df)-nrow(filter(df, eval(rlang::parse_expr(filter))))
    df = filter(df, eval(rlang::parse_expr(filter)))
  }
  return(list("filtered_data" = df, "filter_effects"=filters))
}

Apply the function to your inputs:

f(my_data, my_filters)

Output:

$filtered_data
   Criterion1 Criterion2
1           1          9
2           3          7
3           4          6
4           5          5
5           7          3
6           8          2
7           1          3
8           3          1
9           3          2
10          4          3
11          5          1
12          6          2
13          7          3
14          3          5
15          4          6
16          6          7
17          5          8
18          4          9
19          3          5
20          1          3

$filter_effects
           Filter            Description isolated_n_removed hierarchical_n_removed
1 Criterion1 != 2 Remove all Rows with 2                  4                      4
2 Criterion2 != 4 Remove all Rows with 4                  3                      1
3 Criterion1 != 9 Remove all Rows with 9                  1                      1
  •  Tags:  
  • r
  • Related