Home > Mobile >  Sum values if date falls within range (two dataframes)
Sum values if date falls within range (two dataframes)

Time:02-17

I have two data frames that look like below:

id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")

df1 <- data.frame(id1, start, end)
id2 <- c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" )
date <- c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25", "2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15")
value <- c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5")

df2 <- data.frame(id2, date, value)

I am trying to match by id and sum the values in df2 if the dates fall with in the start and end date in df1.

The output should look like below. I've been trying to work this through dplyr and haven't had any luck.

###EXPECTED OUTPUT####

id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")
output_value <- c("1", "3", "1.5", "0.5", "1")

output <- data.frame(id1, start, end, output_value)

Thanks for your help!

CodePudding user response:

You'll need to first combine the two dataframe using left_join, than make sure the date, start, and end columns are type of "Date", and make sure the column value is numeric. Then remove the date that are not within start and end. Finally group_by the relevant columns and sum up the value.

library(tidyverse)

left_join(df2, df1, by = c("id2" = "id1")) %>% 
  mutate(across(c(date, start, end), as.Date),
         value = as.numeric(value)) %>% 
  filter(date >= start & date <= end) %>% 
  group_by(id2, start, end) %>% 
  summarize(value = sum(value), .groups = "drop")


# A tibble: 5 x 4
  id2   start      end        value
  <chr> <date>     <date>     <dbl>
1 1     2019-07-09 2019-07-10   1  
2 2     2021-03-22 2021-03-27   3  
3 3     2019-02-25 2019-02-27   1.5
4 4     2019-01-27 2019-01-28   0.5
5 4     2019-02-14 2019-02-16   1  

CodePudding user response:

The lubridate package's %within% operator and interval function is useful for this kind of thing. You could do something like this:

library(dplyr)
library(lubridate)

df1 %>% 
  inner_join(df2, by = c("id1" = "id2")) %>% 
  mutate(across(c(date, start, end), as.Date),
         value = as.numeric(value)) %>% 
  group_by(id1, start, end) %>% 
  summarise(output_value =  sum(
    ifelse(
      date %within% interval(start, end),
      value,
      0
    )
  ))

# A tibble: 5 x 4
# Groups:   id1, start [5]
  id1   start      end             output_value
  <chr> <date>     <date>                 <dbl>
1 1     2019-07-09 2019-07-10               1  
2 2     2021-03-22 2021-03-27               3  
3 3     2019-02-25 2019-02-27               1.5
4 4     2019-01-27 2019-01-28               0.5
5 4     2019-02-14 2019-02-16               1  

CodePudding user response:

Another option could be to use the data.table package. Below is a quick way to get your desired output. You do need to make sure to convert your data.frames to data.tables.

library(data.table)
df1 <- data.table(id1 = c("1", "2", "3", "4", "4"), 
                  start = c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14"), 
                  end = c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16"))

df2 <- data.table(id2 = c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" ), 
                  date = c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25", 
                           "2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15"), 
                  value = c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5"))

# convert value from chr to numeric
df2[,value := as.numeric(value)]

# merge tables by ids
new_dt = merge(df1, df2, by.x = "id1", by.y = "id2", all.y = TRUE)

# sum value when date is between start and end and group by id1,start,end
new_dt2 = new_dt[date            
  • Related