Home > other >  How to use left_join without adding additional rows R?
How to use left_join without adding additional rows R?

Time:12-03

I have two tables. The first table has 206 206 rows and the second one 300. After using left_join this way:

final <- left_join(first, second, by="CATEGORY")

the final table has 223 201 rows. I don't understand why additional rows were added. I need to have 206 206 rows at the end, with only additional columns added. What can be the problem? I tried inner_join instead, I still have the same problem.

CodePudding user response:

Joining always adds additional rows if there are multiple ways for matching. Here, c can be both 1 and 2 if a is 1. Use distinct on the id columns (here: a) to just get one of that one-to-many match:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df1 <- tibble(a = c(1,2), b = 2)
df1
#> # A tibble: 2 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     2
df2 <- tibble(a = c(1,1,2), c = c(1,2,3))
df2
#> # A tibble: 3 x 2
#>       a     c
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     2
#> 3     2     3
left_join(df1,df2)
#> Joining, by = "a"
#> # A tibble: 3 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     1
#> 2     1     2     2
#> 3     2     2     3
left_join(df1,df2) %>% distinct(a, .keep_all = TRUE)
#> Joining, by = "a"
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     1
#> 2     2     2     3

Created on 2021-12-02 by the reprex package (v2.0.1)

  • Related