I want to automatically add a new dataset identifier variable when using full_join()
in R.
df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)
df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)
combined <- df1 %>% dplyr::full_join(df2)
I know from ?full_join
that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.
Currently, I'm adding an extra variable in df1 first
df1 <- tribble(~ID, ~x, ~dataset,
"A", 1, 1,
"B", 2, 1,
"C", 3, 1)
and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))
Any suggestions to do it in a better way?
CodePudding user response:
I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id
, then you may try
df1 %>%
full_join(df2) %>%
mutate(dataset = as.numeric(is.na(x)) 1)
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
But to be safe, it might be better just define it's index(?) thing beforehand.
df1 %>%
mutate(dataset = 1) %>%
full_join(df2 %>% mutate(dataset = 2))
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
New data
df1 <- tribble(~ID, ~x,~y,
"A", 1,1,
"B", 2,1,
"C", 3,1)
df2 <- tribble(~ID, ~x,~y,
"D", 4,1,
"E", 5,1,
"F", 6,1)
full_join(df1, df2)
ID x y
<chr> <dbl> <dbl>
1 A 1 1
2 B 2 1
3 C 3 1
4 D 4 1
5 E 5 1
6 F 6 1
CodePudding user response:
Instead of a "join", maybe try bind_rows
from dplyr
:
library(dplyr)
bind_rows(df1, df2, .id = "dataset")
This will bind rows, and the missing columns are filled in with NA
. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).
Output
dataset ID x y
<chr> <chr> <dbl> <dbl>
1 1 A 1 NA
2 1 B 2 NA
3 1 C 3 NA
4 2 D NA 4
5 2 E NA 5
6 2 F NA 6