Home > front end >  left_join adding additional rows R
left_join adding additional rows R

Time:12-03

I have two tables. The first table has 206206 rows and the second one 300. After using left_join this way:

final <- left_join(first, second, by="CATEGORY")

I get that final has 223201 rows, but I don't understand why additional rows were added. I need that at the end still, 206206 should be, only additional columns added. What can be a problem? Tried inner_join instead, still a similar problem.

CodePudding user response:

Joining always adds additional rows if there are multiple ways for matching. Here, c can be both 1 and 2 if a is 1. Use distinct on the id columns (here: a) to just get one of that one-to-many match:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df1 <- tibble(a = c(1,2), b = 2)
df1
#> # A tibble: 2 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     2
df2 <- tibble(a = c(1,1,2), c = c(1,2,3))
df2
#> # A tibble: 3 x 2
#>       a     c
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     2
#> 3     2     3
left_join(df1,df2)
#> Joining, by = "a"
#> # A tibble: 3 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     1
#> 2     1     2     2
#> 3     2     2     3
left_join(df1,df2) %>% distinct(a, .keep_all = TRUE)
#> Joining, by = "a"
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     1
#> 2     2     2     3

Created on 2021-12-02 by the reprex package (v2.0.1)

  • Related