I have a general question about dplyr mutating joins-CodePudding

The documentation for the dplyr package for mutating joins states that for using "by" that "A character vector of variables to join by" is needed.

This however does not seem to be the case.

It seems to work with a numeric or a dbl variable, i.e. column. By this I mean the common column (idenifier) that is used to join the dataframes.

This a glimpse of a portion of the two dataframes. nomem_encr is the common identifier.

df1
Rows: 2,525
Columns: 11
$ nomem_encr <dbl> 800054, 800170, 800186, 800204, 800228, 800274, 
$ mj16a093   <dbl lbl>  4,  4,  3,  4,  2,  5,  5,  3,  5,  3,  6,  
$ mj16a094   <dbl lbl>  3,  4,  2,  4,  2,  5,  5,  2,  3,  2,  6, 

df2
Rows: 6,092
Columns: 3
$ nomem_encr <dbl> 800009, 800015, 800042, 800054, 800057, 800085, 
$ cv16h101   <dbl lbl>   2,   3,   5,   6,   7,   6,   0,   6,   5,   
$ cv16h044   <dbl lbl>  6,  7,  7,  8,  0,  7,  5,  4,  7,  8,  7,  

df3 <- left_join(df1, df2, by = "nomem_encr")

Is it best to convert to a character, or does it not matter? My assumption was that the values just needed to be unique identifiers.

CodePudding user response：

"by" : "A character vector of variables to join by"`

This means that when you write the join, your by argument needs to be a character vector and NOT your column to be a character. for example

left_join(d1, d2, by = c('ID', 'Latitude'))

You are passing a character vector to the by argument. The columns being numeric does not matter

CodePudding user response：

Using Left Join

The wording could probably be better, but what is meant by the help page is that you have to supply a character vector for the name of the variable, but the variable type can be anything. Basically you are telling the by argument "join by this variable explicitly". A visual from R for Data Science by Hadley Wickham may help illustrate how this is accomplished with left_join, where the "key" is the by column necessary for joining:

An easy example can be found with the nycflights13 datasets. First you can load the tidyverse and nycflights13 packages.

#### Load Library ####
library(tidyverse)
library(nycflights13)

Then you can check which columns are shared by the flights dataset and the planes dataset.

#### Check Joinable Columns ####
colnames(flights)
colnames(planes) # share tail number

We find they both contain "tailnum":

> colnames(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
> colnames(planes) # share tail number
[1] "tailnum"      "year"         "type"         "manufacturer" "model"       
[6] "engines"      "seats"        "speed"        "engine"

We can combine both datasets with by via a left_join with the tailnum variable like so:

#### Join Two Dataframes ####
flight2 <- flights %>% 
  left_join(planes,
            by = "tailnum")
flight2

Which gives us our combined data:

# A tibble: 336,776 × 27
   year.x month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1   2013     1     1      517        515       2     830     819      11 UA     
 2   2013     1     1      533        529       4     850     830      20 UA     
 3   2013     1     1      542        540       2     923     850      33 AA     
 4   2013     1     1      544        545      -1    1004    1022     -18 B6     
 5   2013     1     1      554        600      -6     812     837     -25 DL     
 6   2013     1     1      554        558      -4     740     728      12 UA     
 7   2013     1     1      555        600      -5     913     854      19 B6     
 8   2013     1     1      557        600      -3     709     723     -14 EV     
 9   2013     1     1      557        600      -3     838     846      -8 B6     
10   2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 17 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
#   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>, and abbreviated variable names ¹sched_dep_time, ²dep_delay,
#   ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Generalizability of Other Joins

This is accomplished by all other joins in dplyr by the way, but how it does this is different depending on the join you use. The left_join function includes all rows in x (the first dataframe used, in this case flights) whereas a right_join joins by y (the second dataframe planes). Here are the three other examples, which I have labeled by their usage:

#### Alternative Joins ####
flights %>% 
  right_join(planes,
             by = "tailnum") # includes all rows in y

flights %>% 
  inner_join(planes,
             by = "tailnum") # includes all rows in x AND y

flights %>% 
  full_join(planes,
            by = "tailnum") # includes all rows in x OR y

Play around with them and see which is the best for your use case.