How to identify which data belongs to which dataset after merging them in R?-CodePudding

I have two big datasets with over 3 million obs each and I used full_join to merge them together using the variable "N_AIH". Things is, in dataset 1 this variable is called "N_AIH" and in dataset 2 it's called "NUM_AIH". This is how I joined them:

join_test <- full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)

I have to keep both variables in the joined datasets,but now I need to identify:

1 - Obs that were in both datasets (the matches) 2 - Obs that were in dataset 1 but weren't in dataset2 3 - Obs that were in dataset 2 but weren't in dataset1

I can't seem to find a way to do it. I need to use the N_AIH/NUM_AIH variables.

CodePudding user response：

library(tidyverse)

dataset1 <- tibble(N_AIH = seq(5))
dataset1
#> # A tibble: 5 x 1
#>   N_AIH
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5
dataset2 <- tibble(NUM_AIH = seq(3))
dataset2
#> # A tibble: 3 x 1
#>   NUM_AIH
#>     <int>
#> 1       1
#> 2       2
#> 3       3

joined <-
  full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
joined
#> # A tibble: 5 x 2
#>   N_AIH NUM_AIH
#>   <int>   <int>
#> 1     1       1
#> 2     2       2
#> 3     3       3
#> 4     4      NA
#> 5     5      NA

# observations present in both datasets
inner_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE) %>%
  distinct(N_AIH, .keep_all = TRUE)
#> # A tibble: 3 x 2
#>   N_AIH NUM_AIH
#>   <int>   <int>
#> 1     1       1
#> 2     2       2
#> 3     3       3

# Obs that were in dataset 1 but weren't in dataset2 
anti_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"))
#> # A tibble: 2 x 1
#>   N_AIH
#>   <int>
#> 1     4
#> 2     5

# Obs that were in dataset 2 but weren't in dataset1
anti_join(dataset2, dataset1, by = c("NUM_AIH" = "N_AIH"))
#> # A tibble: 0 x 1
#> # … with 1 variable: NUM_AIH <int>

^{Created on 2021-10-04 by the reprex package (v2.0.1)}