Home > database >  How to automate a series of anti_joins and create multiple dataframes for the outputs
How to automate a series of anti_joins and create multiple dataframes for the outputs

Time:10-14

I have a series of dataframes that I need to cross check against a master dataframe and return a dataframe for each check. It gets a bit tedious to have to write out the antijoin for each data frame.

e.g. I want to compare all data frames against "d1" and produce a data frame for each comparison.

d1 <- data.frame(su = c("c", "d", "c"))
d2 <- data.frame(su = c("c", "e", "d"))
d3 <- data.frame(su = c("c", "d", "f"))

I have already written a function to perform anti_joins...

perform_anti_join <- function(master_df, checking_df) {
  joined_dfs <- master_df %>%
    anti_join(checking_df, by = "su")
  return(joined_dfs)
}

I essentially want to create a loop using my above function that will take a series of dataframes (e.g. df2, df3) and compare it against the master df (df1) and each to produce a dataframe of the results. So in this example, I would have 2 dataframes one which checks df1 - df2 and one that checks df1 - df3.

Is anyone able to show me a way that I could automate this?

TIA

CodePudding user response:

The easiest way to perform the same action on multiple values is to put them in a list. Then you can use functions like purrr:map to iterate over that list (note that purrr is in the tidyverse so it may already be loaded). For example

library(purrr)
map(list(d2, d3), ~perform_anti_join(d1, .x))

will take d2 and d3 and pass that to perform_anti_join in the second parameter. You can put however many objects you like in the list.

You could also choose to wrap this in a function to make it easier to create the list and the master table

check_all <- function(master, ...) {
  map(list(...), ~perform_anti_join(master, .x))
}
check_all(d1, d2, d3)

Here we use ... to capture all the arguments after the first.

CodePudding user response:

I propose the following solution. First, collect the frames, or rather tibbles into one collective tibble like this:

library(tidyverse)

df = tibble(
  df = list(
    tibble(su = c("c", "d", "c")),
    tibble(su = c("c", "e", "d")),
    tibble(su = c("c", "d", "f")),
    tibble(su = c("a", "b", "c", "d")),
    tibble(su = c("b", "d", "f", "g"))
  ),
  id = 1:length(df)
)

Next is your perform_anti_join function but minimally simplified

perform_anti_join = function(master_df, checking_df) master_df %>%
    anti_join(checking_df, by = "su")

Now let's make a comparison to the fourth tibble

master_df = df$df[[4]]
dfperf = df %>% group_by(id) %>% 
  mutate(perform_anti_join = map(df, ~perform_anti_join(master_df, .x)))

output

# A tibble: 5 x 3
# Groups:   id [5]
  df                  id perform_anti_join
  <list>           <int> <list>           
1 <tibble [3 x 1]>     1 <tibble [2 x 1]> 
2 <tibble [3 x 1]>     2 <tibble [2 x 1]> 
3 <tibble [3 x 1]>     3 <tibble [2 x 1]> 
4 <tibble [4 x 1]>     4 <tibble [0 x 1]> 
5 <tibble [4 x 1]>     5 <tibble [2 x 1]> 

So let's see what we have here

dfperf$perform_anti_join[[1]]
# A tibble: 2 x 1
# su   
# <chr>
# 1 a    
# 2 b  

dfperf$perform_anti_join[[5]]
#  A tibble: 2 x 1
# su   
# <chr>
# 1 a    
# 2 c  

Hope you like it.

  • Related