How to merge by two columns aggregating one of them-CodePudding

I'm struggling on how can I make a merge using two columns. I have one dataframe containing measure about how much palette was used in some dates. I have another dataframe containing the distance travelled by the car. Then I need to merge both, and the condition to join is that: the car and the sum of the distance of one car until the date that the measure of the palette occur. Here is a toy example:

#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
#  car      data1 palette 
#1   A 20-09-2020   5
#2   A 15-10-2020   4
#3   A 13-05-2021   3
#4   B 20-10-2021   5

#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car      data2 distance
#1   A 20-09-2020       10
#2   C 14-05-2020       20
#3   B 20-10-2021       10
#4   A 10-01-2021        5
#5   A 11-01-2021        3
#6   A 13-01-2021        8

#for result I'd like something like
#  car      data1 palette distance
#1   A 20-09-2020   5       10
#2   A 15-10-2020   4       0
#3   A 13-05-2020   3       16
#4   B 20-10-2021   5       10

Note that the distance are summed until I have a date that the palette are measured. So I can say that a car has covered a distance of 16 km and its palette is 3 cm.

I thought that I could use something like merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T), but I don't know how to sum the distance values until the date of the pallete measure for a specif car.

Any hint on how could I do that?

CodePudding user response：

Something like this would work:

library(tidyverse)
library(lubridate)

result <- left_join(measure, dist_, by = c("car")) %>% 
  mutate(across(c("data1", "data2"), dmy)) %>% 
  filter(data1 >= data2) %>% 
  group_by(car, data2) %>% 
  mutate(threshold = min(data1)) %>% 
  ungroup() %>% 
  filter(data1 == threshold) %>% 
  group_by(car, data1, palette)%>% 
  summarise(distance = sum(distance))

result
# A tibble: 3 x 4
# Groups:   car, data1 [3]
  car   data1      palette distance
  <chr> <date>       <dbl>    <dbl>
1 A     2020-09-20       5       10
2 A     2021-05-13       3       16
3 B     2021-10-20       5       10

If you want to keep the non-matches you could then rejoin with measure like so:

result.final <- measure %>% 
  mutate(data1 = dmy(data1))%>% 
  left_join(result, by = c("car", "data1", "palette")) 

result.final
  car      data1 palette distance
1   A 2020-09-20       5       10
2   A 2020-10-15       4       NA
3   A 2021-05-13       3       16
4   B 2021-10-20       5       10