I want to systematically and one by one apply multiple filters on a data frame and get to know how many rows were removed by each filter.
Bonus points if it's possible to return how many rows would have been removed by filter n if non of the filters 1, ..., n-1 had been applied.
My current code as a starting point:
my_data <- data.frame(Criterion1 = c(1:9,1:3,3:7,2:4,6:1),
Criterion2 = c(9:1,3:1,2:3,1:9,5:3))
my_filters <- data.frame(Filter = c('Criterion1 != 2', 'Criterion2 != 4', 'Criterion1 != 9'),
Description = c('Remove all Rows with 2', 'Remove all Rows with 4', 'Remove all Rows with 9'))
lapply(my_filters, function(cur_filter) {
my_data <- filter(my_data, cur_filter$Filter)
cat(nrow(my_data), 'rows remaining after applying Filter "', cur_filter$Description, '"') # nicer: printing how many rows have been removed
})
(Instead of printing every time with cat
, I think it would be nice to instead create a table, collect all "removing information" there and print the table at the end.)
I'm open to store my filters other than currently.
CodePudding user response:
You could write a function that takes your data frame and your filter frame, and returns a list of the final data (after all filters applied), and information on what is the effect of either isolated applicaton of the filter or hierachical application of the filter:
library(dplyr)
f <- function(df, filters) {
static_df = df
for(i in 1:nrow(filters)) {
filter = filters[i, "Filter"]
filters[i,"isolated_n_removed"] <- nrow(static_df) - nrow(filter(static_df, eval(rlang::parse_expr(filter))))
filters[i, "hierarchical_n_removed"] <- nrow(df)-nrow(filter(df, eval(rlang::parse_expr(filter))))
df = filter(df, eval(rlang::parse_expr(filter)))
}
return(list("filtered_data" = df, "filter_effects"=filters))
}
Apply the function to your inputs:
f(my_data, my_filters)
Output:
$filtered_data
Criterion1 Criterion2
1 1 9
2 3 7
3 4 6
4 5 5
5 7 3
6 8 2
7 1 3
8 3 1
9 3 2
10 4 3
11 5 1
12 6 2
13 7 3
14 3 5
15 4 6
16 6 7
17 5 8
18 4 9
19 3 5
20 1 3
$filter_effects
Filter Description isolated_n_removed hierarchical_n_removed
1 Criterion1 != 2 Remove all Rows with 2 4 4
2 Criterion2 != 4 Remove all Rows with 4 3 1
3 Criterion1 != 9 Remove all Rows with 9 1 1