I have these two dataframes for example:
dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'),
rating = c(10, 5, 6),
rating2 = c(5, 6, 7))
df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating = c(1, 2, 1))
> df1
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
> df2
dates area rating
1 2020-11-19 budapest 1
2 2020-11-20 moscow 2
3 2020-11-21 valencia 1
When performing an outer join using dplyr:
df <- df1 %>%
full_join(df2, by = c('dates', 'area'))
the result is like this:
dates area rating.x rating2 rating.y
1 2020-11-19 paris 10 5 NA
2 2020-11-20 london 5 6 NA
3 2020-11-21 newyork 6 7 NA
4 2020-11-19 budapest NA NA 1
5 2020-11-20 moscow NA NA 2
6 2020-11-21 valencia NA NA 1
i.e. the rating columns from the two dataframes are not blended together but two separate columns are created.
How do I get a result like this?
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
Thanks to the solutions provided by @kybazzi, the desired result was obtained.
df <- df1 %>%
bind_rows(df2)
FOLLOW-UP
As a follow-up question, I would like to join the following to the joined dataframe:
df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating2 = c(3, 2, 5))
Using the same method, the result is this:
> df_final <- df %>%
bind_rows(df3)
> df_final
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
7 2020-11-19 budapest NA 3
8 2020-11-20 moscow NA 2
9 2020-11-21 valencia NA 5
How do i get a result like this:
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 3
5 2020-11-20 moscow 2 2
6 2020-11-21 valencia 1 5
CodePudding user response:
What you're looking for is dplyr::bind_rows()
, which will preserve common columns and fill NA
for columns that only exist in one of the data frames:
> bind_rows(df1, df2)
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
Note that you could also continue using full_join()
- but you must ensure that all common columns between the data frames are included as keys if you don't want columns to be split:
> full_join(
df1, df2,
by = c("dates", "area", "rating")
)
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
The documentation for dplyr joins mentions:
Output columns include all
x
columns and ally
columns. If columns inx
andy
have the same name (and aren't included inby
), suffixes are added to disambiguate.
You could also avoid this issue by not specifying by
, in which case dplyr will use all common columns.
> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
As far as I know, both methods are good for your use case. In fact, I believe that the practical advantage full_join()
has over bind_rows()
is precisely this behaviour you wanted to avoid here, i.e. splitting columns that aren't keys.