Let's say I'm filtering a tibble, doing some processing, filtering and then doing some more processing, but I want to keep the complement of data that I filtered out at each step.
E.g. instead of
library(tidyverse)
data(mtcars)
mtcars %>%
filter(cyl<5) %>%
filter(gear>3 & wt>3) %>%
filter(mpg>23)
I might want something like this without doing separate steps where I store a tibble, do another step to store the complement and then do another step etc.
library(tidyverse)
data(mtcars)
mtcars1 = mtcars %>%
filter(cyl<5, keep_complement="mtcars2") %>%
filter(gear>3 & wt>3, keep_complement="mtcars3") %>%
filter(mpg>23, keep_complement="mtcars4")
# Desired outcome: 4 tibbles mtcars1 to 4
mtcars_final = bind_rows(mtcars1, mtcars2, mtcars3, mtcars4)
In case you wonder why I want this: I have some increasingly complex string operations to solve a problem, first simple direct comparison, then some regex/fuzzy string matching, then something for which I may end up having to use a neural network. It feels like there should be some neat way of doing the expensive operations only on a subset without having to write the code in so many steps.
I.e. what I'm trying to avoid is something really clumsy looking like this (which also requires me to invert any filtering operations myself - in practice also having to consider NA values etc.):
mtcars_tmp <- mtcars %>%
filter(cyl<5)
mtcars2 <- mtcars %>%
filter(cyl>=5)
mtcars_tmp2 <- mtcars_tmp %>%
filter(gear>3 & wt>3)
mtcars3 <- mtcars_tmp %>%
filter(gear<=3 | wt<=3)
mtcars1 <- mtcars_tmp2 %>%
filter(mpg>23)
mtcars4 <- mtcars_tmp2 %>%
filter(mpg<=23)
mtcars_final = bind_rows(mtcars1, mtcars2, mtcars3, mtcars4)
CodePudding user response:
This will create the complement using anti_join
and assigns it to a new object and returns the filtered results so it can be used just as a normal dplyr::filter
in the pipe:
library(tidyverse)
#' Filters a data.frame and saves the complement
#' @param keep_complement charachter to name the object the complement is saved to. NULL to not save it.
filter_complement <- function(.data, ..., keep_complement = NULL) {
res <- dplyr::filter(.data = .data, ...)
if(! is.null(keep_complement)) {
complement <- dplyr::anti_join(.data, res)
assign(keep_complement, complement, envir = globalenv())
}
res
}
mtcars %>%
filter(cyl < 5) %>%
filter(gear > 3 & wt > 3) %>%
filter(mpg > 23)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
mtcars %>%
filter_complement(cyl < 5, keep_complement = "mtcars2") %>%
filter_complement(gear > 3 & wt > 3, keep_complement = "mtcars3") %>%
filter_complement(mpg > 23, keep_complement = "mtcars4")
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am",
#> "gear", "carb")
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am",
#> "gear", "carb")
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am",
#> "gear", "carb")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
mtcars4
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
mtcars3
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Created on 2022-05-06 by the reprex package (v2.0.0)