I want to automatically add a new dataset identifier variable when using full_join() in R.

df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)

df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)

combined <- df1 %>% dplyr::full_join(df2)

I know from ?full_join that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.

Currently, I'm adding an extra variable in df1 first

df1 <- tribble(~ID, ~x, ~dataset,
    "A", 1, 1,
    "B", 2, 1,
    "C", 3, 1)

and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))

Any suggestions to do it in a better way?

CodePudding user response：

I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id, then you may try

df1 %>%
  full_join(df2) %>%
  mutate(dataset = as.numeric(is.na(x)) 1)

  ID        x     y dataset
  <chr> <dbl> <dbl>   <dbl>
1 A         1    NA       1
2 B         2    NA       1
3 C         3    NA       1
4 D        NA     4       2
5 E        NA     5       2
6 F        NA     6       2

But to be safe, it might be better just define it's index(?) thing beforehand.

df1 %>%
  mutate(dataset = 1) %>%
  full_join(df2 %>% mutate(dataset = 2))

  ID        x     y dataset
  <chr> <dbl> <dbl>   <dbl>
1 A         1    NA       1
2 B         2    NA       1
3 C         3    NA       1
4 D        NA     4       2
5 E        NA     5       2
6 F        NA     6       2

New data

df1 <- tribble(~ID, ~x,~y,
               "A", 1,1,
               "B", 2,1,
               "C", 3,1)

df2 <- tribble(~ID, ~x,~y,
               "D", 4,1,
               "E", 5,1,
               "F", 6,1)

full_join(df1, df2)

  ID        x     y
  <chr> <dbl> <dbl>
1 A         1     1
2 B         2     1
3 C         3     1
4 D         4     1
5 E         5     1
6 F         6     1

CodePudding user response：

Instead of a "join", maybe try bind_rows from dplyr:

library(dplyr)

bind_rows(df1, df2, .id = "dataset")

This will bind rows, and the missing columns are filled in with NA. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).

Output

  dataset ID        x     y
  <chr>   <chr> <dbl> <dbl>
1 1       A         1    NA
2 1       B         2    NA
3 1       C         3    NA
4 2       D        NA     4
5 2       E        NA     5
6 2       F        NA     6