Home > Software engineering >  R dplyr full_join - no common key, need common columns to blend together
R dplyr full_join - no common key, need common columns to blend together

Time:12-18

I have these two dataframes for example:

dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'), 
                  rating = c(10, 5, 6),
                  rating2 = c(5, 6, 7))

df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'), 
                  rating = c(1, 2, 1))
> df1
       dates    area rating rating2
1 2020-11-19   paris     10       5
2 2020-11-20  london      5       6
3 2020-11-21 newyork      6       7
> df2
       dates     area rating
1 2020-11-19 budapest      1
2 2020-11-20   moscow      2
3 2020-11-21 valencia      1

When performing an outer join using dplyr:

df <- df1 %>%
  full_join(df2, by = c('dates', 'area'))

the result is like this:

       dates     area rating.x rating2 rating.y
1 2020-11-19    paris       10       5       NA
2 2020-11-20   london        5       6       NA
3 2020-11-21  newyork        6       7       NA
4 2020-11-19 budapest       NA      NA        1
5 2020-11-20   moscow       NA      NA        2
6 2020-11-21 valencia       NA      NA        1

i.e. the rating columns from the two dataframes are not blended together but two separate columns are created.

How do I get a result like this?

       dates     area rating   rating2 
1 2020-11-19    paris       10       5       
2 2020-11-20   london        5       6       
3 2020-11-21  newyork        6       7       
4 2020-11-19 budapest        1      NA        
5 2020-11-20   moscow        2      NA        
6 2020-11-21 valencia        1      NA        

Thanks to the solutions provided by @kybazzi, the desired result was obtained.

df <- df1 %>%
  bind_rows(df2)

FOLLOW-UP

As a follow-up question, I would like to join the following to the joined dataframe:

df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'), 
                  rating2 = c(3, 2, 5))

Using the same method, the result is this:

> df_final <- df %>%
      bind_rows(df3)
> df_final
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA
7 2020-11-19 budapest     NA       3
8 2020-11-20   moscow     NA       2
9 2020-11-21 valencia     NA       5

How do i get a result like this:

       dates     area rating   rating2 
1 2020-11-19    paris       10       5       
2 2020-11-20   london        5       6       
3 2020-11-21  newyork        6       7       
4 2020-11-19 budapest        1       3        
5 2020-11-20   moscow        2       2        
6 2020-11-21 valencia        1       5

CodePudding user response:

What you're looking for is dplyr::bind_rows(), which will preserve common columns and fill NA for columns that only exist in one of the data frames:

> bind_rows(df1, df2)
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

Note that you could also continue using full_join() - but you must ensure that all common columns between the data frames are included as keys if you don't want columns to be split:

> full_join(
    df1, df2,
    by = c("dates", "area", "rating")
  )
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

The documentation for dplyr joins mentions:

Output columns include all x columns and all y columns. If columns in x and y have the same name (and aren't included in by), suffixes are added to disambiguate.

You could also avoid this issue by not specifying by, in which case dplyr will use all common columns.

> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

As far as I know, both methods are good for your use case. In fact, I believe that the practical advantage full_join() has over bind_rows() is precisely this behaviour you wanted to avoid here, i.e. splitting columns that aren't keys.

  • Related