I have two tables. The first table has 206206 rows and the second one 300. After using left_join this way:
final <- left_join(first, second, by="CATEGORY")
I get that final has 223201 rows, but I don't understand why additional rows were added. I need that at the end still, 206206 should be, only additional columns added. What can be a problem? Tried inner_join instead, still a similar problem.
CodePudding user response:
Joining always adds additional rows if there are multiple ways for matching.
Here, c
can be both 1 and 2 if a
is 1.
Use distinct
on the id columns (here: a
) to just get one of that one-to-many match:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 <- tibble(a = c(1,2), b = 2)
df1
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
df2 <- tibble(a = c(1,1,2), c = c(1,2,3))
df2
#> # A tibble: 3 x 2
#> a c
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 2
#> 3 2 3
left_join(df1,df2)
#> Joining, by = "a"
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 1
#> 2 1 2 2
#> 3 2 2 3
left_join(df1,df2) %>% distinct(a, .keep_all = TRUE)
#> Joining, by = "a"
#> # A tibble: 2 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 1
#> 2 2 2 3
Created on 2021-12-02 by the reprex package (v2.0.1)