Home > Back-end >  dplyr: access full data frame (including added column) inside filter()
dplyr: access full data frame (including added column) inside filter()

Time:07-23

I'm sure there is a better way to ask this question, and perhaps if I knew it I'd have found an existing answer. The gist is this:

  • I'm reading in SQL data with R with NEW, UPDATE, and DELETE rows
  • table uniqueness is determined by values in two columns (call them a and b)
  • I want (a) find all combinations of a and b in which column c has a DELETE value, and (b) then remove all rows with one of these a and b combinations

Here's a reproducible example:

library(dplyr)

df <- data.frame(
  a = c(1, 1, 1, 2, 2),
  b = c(3, 3, 4, 5, 5),
  c = c("new", "delete", "new", "new", "update")
)

What I tried, not thinking this through:

df %>%
  mutate(filter_key = paste0(a, "_", b)) %>%
  filter(!(filter_key %in% (df %>%
                              filter(c == "delete") %>%
                              pull(filter_key) %>%
                              unique())))

The problem is that filter_key is not in df, so it's not there to pull unique values for. I'm curious if there is a way to refer to it "on the fly"? Some answers led me to think I could perhaps use . or .data, but this didn't work for me (replacing df with . or .data).

The trivial solution is to add it beforehand:

df <- df %>% mutate(filter_key = paste0(a, "_", b))
df %>%
  filter(!(filter_key %in% (df %>%
                              filter(c == "delete") %>%
                              pull(filter_key) %>%
                              unique())))

Alternatively (though this changes the question, so I'll leave it in these "notes"), there's probably a way to just filter directly, but I wasn't sure how to say:

  • give me all the combos of a and b where any row has c == "delete"
  • remove all rows where a and b match one of these combos

CodePudding user response:

You can do:

library(tidyverse)

df %>%
  mutate(both = str_c(a, b)) %>%
  filter(!both %in% (cur_data() %>%
                      filter(c == 'delete') %>%
                      pull(both)))

  a b      c both
1 1 4    new   14
2 2 5    new   25
3 2 5 update   25

CodePudding user response:

Use an anti-join:

df %>% anti_join(filter(., c == "delete"), by = c("a", "b"))

giving

  a b      c
1 1 4    new
2 2 5    new
3 2 5 update
  • Related