I am joining two data frames in R and I only want a value from one of them to show up once. This scenario is similar to the code below. Suppose I have these customers given by the id in the code below. In the data frame, t, I have the customers id and the items they bought. In the data frame, t2, I have the total amount they spent. When I go to use a join, the total amount spent column gets used twice and throws off the numbers that I really want. Does anyone know how to fix this issue that can be used for large data frames?
t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))
CodePudding user response:
So, this output is not what you wanted?
library(tidyverse)
t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))
left_join(t, t2) %>%
as_tibble()
#> Joining, by = "id"
#> # A tibble: 4 x 3
#> id product total_spent
#> <dbl> <chr> <dbl>
#> 1 1 a 12
#> 2 1 b 12
#> 3 2 b 23
#> 4 3 c 24
Created on 2022-06-29 by the reprex package (v2.0.1)
CodePudding user response:
Expanding on answer from @tom-hoel. You can use left join and then adjust total_spent
to NA if id is not unique.
library(dplyr)
t <- data.frame(id=c(1,1,2,3),product=c("a","b","b","c"))
t2 <- data.frame(id=c(1,2,3),total_spent=c(12,23,24))
t %>%
left_join(t2, by = "id") %>%
mutate(total_spent = ifelse(duplicated(id), NA, total_spent))
output
id product total_spent
1 1 a 12
2 1 b NA
3 2 b 23
4 3 c 24
see ?duplicated
and ?ifelse