I have two data frames that look like below:
id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")
df1 <- data.frame(id1, start, end)
id2 <- c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" )
date <- c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25", "2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15")
value <- c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5")
df2 <- data.frame(id2, date, value)
I am trying to match by id and sum the values in df2 if the dates fall with in the start and end date in df1.
The output should look like below. I've been trying to work this through dplyr and haven't had any luck.
###EXPECTED OUTPUT####
id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")
output_value <- c("1", "3", "1.5", "0.5", "1")
output <- data.frame(id1, start, end, output_value)
Thanks for your help!
CodePudding user response:
You'll need to first combine the two dataframe using left_join
, than make sure the date
, start
, and end
columns are type of "Date", and make sure the column value
is numeric. Then remove the date
that are not within start
and end
. Finally group_by
the relevant columns and sum
up the value.
library(tidyverse)
left_join(df2, df1, by = c("id2" = "id1")) %>%
mutate(across(c(date, start, end), as.Date),
value = as.numeric(value)) %>%
filter(date >= start & date <= end) %>%
group_by(id2, start, end) %>%
summarize(value = sum(value), .groups = "drop")
# A tibble: 5 x 4
id2 start end value
<chr> <date> <date> <dbl>
1 1 2019-07-09 2019-07-10 1
2 2 2021-03-22 2021-03-27 3
3 3 2019-02-25 2019-02-27 1.5
4 4 2019-01-27 2019-01-28 0.5
5 4 2019-02-14 2019-02-16 1
CodePudding user response:
The lubridate
package's %within%
operator and interval
function is useful for this kind of thing. You could do something like this:
library(dplyr)
library(lubridate)
df1 %>%
inner_join(df2, by = c("id1" = "id2")) %>%
mutate(across(c(date, start, end), as.Date),
value = as.numeric(value)) %>%
group_by(id1, start, end) %>%
summarise(output_value = sum(
ifelse(
date %within% interval(start, end),
value,
0
)
))
# A tibble: 5 x 4
# Groups: id1, start [5]
id1 start end output_value
<chr> <date> <date> <dbl>
1 1 2019-07-09 2019-07-10 1
2 2 2021-03-22 2021-03-27 3
3 3 2019-02-25 2019-02-27 1.5
4 4 2019-01-27 2019-01-28 0.5
5 4 2019-02-14 2019-02-16 1
CodePudding user response:
Another option could be to use the data.table
package. Below is a quick way to get your desired output. You do need to make sure to convert your data.frames to data.tables.
library(data.table)
df1 <- data.table(id1 = c("1", "2", "3", "4", "4"),
start = c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14"),
end = c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16"))
df2 <- data.table(id2 = c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" ),
date = c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25",
"2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15"),
value = c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5"))
# convert value from chr to numeric
df2[,value := as.numeric(value)]
# merge tables by ids
new_dt = merge(df1, df2, by.x = "id1", by.y = "id2", all.y = TRUE)
# sum value when date is between start and end and group by id1,start,end
new_dt2 = new_dt[date