I have a series of dataframes that I need to cross check against a master dataframe and return a dataframe for each check. It gets a bit tedious to have to write out the antijoin for each data frame.
e.g. I want to compare all data frames against "d1" and produce a data frame for each comparison.
d1 <- data.frame(su = c("c", "d", "c"))
d2 <- data.frame(su = c("c", "e", "d"))
d3 <- data.frame(su = c("c", "d", "f"))
I have already written a function to perform anti_joins...
perform_anti_join <- function(master_df, checking_df) {
joined_dfs <- master_df %>%
anti_join(checking_df, by = "su")
return(joined_dfs)
}
I essentially want to create a loop using my above function that will take a series of dataframes (e.g. df2, df3) and compare it against the master df (df1) and each to produce a dataframe of the results. So in this example, I would have 2 dataframes one which checks df1 - df2 and one that checks df1 - df3.
Is anyone able to show me a way that I could automate this?
TIA
CodePudding user response:
The easiest way to perform the same action on multiple values is to put them in a list. Then you can use functions like purrr:map
to iterate over that list (note that purrr
is in the tidyverse so it may already be loaded). For example
library(purrr)
map(list(d2, d3), ~perform_anti_join(d1, .x))
will take d2
and d3
and pass that to perform_anti_join
in the second parameter. You can put however many objects you like in the list.
You could also choose to wrap this in a function to make it easier to create the list and the master table
check_all <- function(master, ...) {
map(list(...), ~perform_anti_join(master, .x))
}
check_all(d1, d2, d3)
Here we use ...
to capture all the arguments after the first.
CodePudding user response:
I propose the following solution. First, collect the frames, or rather tibble
s into one collective tibble
like this:
library(tidyverse)
df = tibble(
df = list(
tibble(su = c("c", "d", "c")),
tibble(su = c("c", "e", "d")),
tibble(su = c("c", "d", "f")),
tibble(su = c("a", "b", "c", "d")),
tibble(su = c("b", "d", "f", "g"))
),
id = 1:length(df)
)
Next is your perform_anti_join
function but minimally simplified
perform_anti_join = function(master_df, checking_df) master_df %>%
anti_join(checking_df, by = "su")
Now let's make a comparison to the fourth tibble
master_df = df$df[[4]]
dfperf = df %>% group_by(id) %>%
mutate(perform_anti_join = map(df, ~perform_anti_join(master_df, .x)))
output
# A tibble: 5 x 3
# Groups: id [5]
df id perform_anti_join
<list> <int> <list>
1 <tibble [3 x 1]> 1 <tibble [2 x 1]>
2 <tibble [3 x 1]> 2 <tibble [2 x 1]>
3 <tibble [3 x 1]> 3 <tibble [2 x 1]>
4 <tibble [4 x 1]> 4 <tibble [0 x 1]>
5 <tibble [4 x 1]> 5 <tibble [2 x 1]>
So let's see what we have here
dfperf$perform_anti_join[[1]]
# A tibble: 2 x 1
# su
# <chr>
# 1 a
# 2 b
dfperf$perform_anti_join[[5]]
# A tibble: 2 x 1
# su
# <chr>
# 1 a
# 2 c
Hope you like it.