I have a dataset that spans a large range of dates and has associated values on each individual date.
I have a second dataset with start/end dates. I want to summarize the values in dataset 1, df in this example, based on the dates in dataset 2, date_df in this example.
The "goal_column" is what I want to be able to code. How can I accomplish this?
x <- c(8,7,4,6)
dates <- c("2021-01-02","2021-01-03","2021-01-04","2021-01-08")
df <- data.frame(x, dates)
df
start_dates <- c("2021-01-01","2021-01-07")
end_dates <- c("2021-01-06","2021-01-10")
goal_column <- c((8 7 4),6)
date_df <- data.frame(start_dates, end_dates, goal_column)
date_df
CodePudding user response:
Here's one way:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- c(8,7,4,6)
dates <- lubridate::ymd(c("2021-01-02","2021-01-03","2021-01-04","2021-01-08"))
df <- data.frame(x, dates)
df
#> x dates
#> 1 8 2021-01-02
#> 2 7 2021-01-03
#> 3 4 2021-01-04
#> 4 6 2021-01-08
start_dates <- lubridate::ymd(c("2021-01-01","2021-01-07"))
end_dates <- lubridate::ymd(c("2021-01-06","2021-01-10"))
goal_column <- c((8 7 4),6)
goal_column = purrr::map(seq_along(start_dates), ~df %>%
filter(dates >= start_dates[.x] & dates <= end_dates[.x]) %>%
summarise(x=sum(x))) %>%
bind_rows() %>%
select(x) %>%
pull
goal_column
#> [1] 19 6
Created on 2022-02-15 by the reprex package (v2.0.1)
CodePudding user response:
A couple alternatives to consider. I assume your dates are actual dates and not character values.
You can try using fuzzyjoin
to merge the two data.frames, including rows where the dates
fall between start_dates
and end_dates
.
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(
date_df,
df,
by = c("start_dates" = "dates", "end_dates" = "dates"),
match_fun = list(`<=`, `>=`)
) %>%
group_by(start_dates, end_dates) %>%
summarise(new_goal_column = sum(x))
Output
start_dates end_dates new_goal_column
<date> <date> <dbl>
1 2021-01-01 2021-01-06 19
2 2021-01-07 2021-01-10 6
You can also try using data.table
and joining.
library(data.table)
setDT(date_df)
setDT(df)
df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
, .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]
Output
start_dates end_dates new_goal_column
1: 2021-01-01 2021-01-06 19
2: 2021-01-07 2021-01-10 6