Home > database >  Making joins in R so that observations are only recorded once
Making joins in R so that observations are only recorded once

Time:06-30

I am joining two data frames in R and I only want a value from one of them to show up once. This scenario is similar to the code below. Suppose I have these customers given by the id in the code below. In the data frame, t, I have the customers id and the items they bought. In the data frame, t2, I have the total amount they spent. When I go to use a join, the total amount spent column gets used twice and throws off the numbers that I really want. Does anyone know how to fix this issue that can be used for large data frames?

t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))

CodePudding user response:

So, this output is not what you wanted?

library(tidyverse)

t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))

left_join(t, t2) %>%  
  as_tibble()

#> Joining, by = "id"
#> # A tibble: 4 x 3
#>      id product total_spent
#>   <dbl> <chr>         <dbl>
#> 1     1 a                12
#> 2     1 b                12
#> 3     2 b                23
#> 4     3 c                24

Created on 2022-06-29 by the reprex package (v2.0.1)

CodePudding user response:

Expanding on answer from @tom-hoel. You can use left join and then adjust total_spent to NA if id is not unique.

library(dplyr)
t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))

t %>%
  left_join(t2, by = "id") %>%
  mutate(total_spent = ifelse(duplicated(id), NA, total_spent))

output

  id product total_spent
1  1       a          12
2  1       b          NA
3  2       b          23
4  3       c          24

see ?duplicated and ?ifelse

  •  Tags:  
  • r
  • Related