I have two big datasets with over 3 million obs each and I used full_join to merge them together using the variable "N_AIH". Things is, in dataset 1 this variable is called "N_AIH" and in dataset 2 it's called "NUM_AIH". This is how I joined them:
join_test <- full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
I have to keep both variables in the joined datasets,but now I need to identify:
1 - Obs that were in both datasets (the matches) 2 - Obs that were in dataset 1 but weren't in dataset2 3 - Obs that were in dataset 2 but weren't in dataset1
I can't seem to find a way to do it. I need to use the N_AIH/NUM_AIH variables.
CodePudding user response:
library(tidyverse)
dataset1 <- tibble(N_AIH = seq(5))
dataset1
#> # A tibble: 5 x 1
#> N_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
dataset2 <- tibble(NUM_AIH = seq(3))
dataset2
#> # A tibble: 3 x 1
#> NUM_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
joined <-
full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
joined
#> # A tibble: 5 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 NA
#> 5 5 NA
# observations present in both datasets
inner_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE) %>%
distinct(N_AIH, .keep_all = TRUE)
#> # A tibble: 3 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
# Obs that were in dataset 1 but weren't in dataset2
anti_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"))
#> # A tibble: 2 x 1
#> N_AIH
#> <int>
#> 1 4
#> 2 5
# Obs that were in dataset 2 but weren't in dataset1
anti_join(dataset2, dataset1, by = c("NUM_AIH" = "N_AIH"))
#> # A tibble: 0 x 1
#> # … with 1 variable: NUM_AIH <int>
Created on 2021-10-04 by the reprex package (v2.0.1)