Home > database >  R - Summarize values between specific date range
R - Summarize values between specific date range

Time:02-16

I have a dataset that spans a large range of dates and has associated values on each individual date.

I have a second dataset with start/end dates. I want to summarize the values in dataset 1, df in this example, based on the dates in dataset 2, date_df in this example.

The "goal_column" is what I want to be able to code. How can I accomplish this?

x <- c(8,7,4,6)
dates <- c("2021-01-02","2021-01-03","2021-01-04","2021-01-08")

df <- data.frame(x, dates)
df

start_dates <- c("2021-01-01","2021-01-07")
end_dates <- c("2021-01-06","2021-01-10")
goal_column <- c((8 7 4),6)

date_df <- data.frame(start_dates, end_dates, goal_column)
date_df

CodePudding user response:

Here's one way:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
x <- c(8,7,4,6)
dates <- lubridate::ymd(c("2021-01-02","2021-01-03","2021-01-04","2021-01-08"))

df <- data.frame(x, dates)
df
#>   x      dates
#> 1 8 2021-01-02
#> 2 7 2021-01-03
#> 3 4 2021-01-04
#> 4 6 2021-01-08

start_dates <- lubridate::ymd(c("2021-01-01","2021-01-07"))
end_dates <- lubridate::ymd(c("2021-01-06","2021-01-10"))
goal_column <- c((8 7 4),6)


goal_column = purrr::map(seq_along(start_dates), ~df %>% 
                           filter(dates >= start_dates[.x] & dates <= end_dates[.x]) %>% 
                           summarise(x=sum(x))) %>% 
  bind_rows() %>% 
  select(x) %>% 
  pull

goal_column
#> [1] 19  6

Created on 2022-02-15 by the reprex package (v2.0.1)

CodePudding user response:

A couple alternatives to consider. I assume your dates are actual dates and not character values.

You can try using fuzzyjoin to merge the two data.frames, including rows where the dates fall between start_dates and end_dates.

library(tidyverse)
library(fuzzyjoin)

fuzzy_left_join(
  date_df,
  df,
  by = c("start_dates" = "dates", "end_dates" = "dates"),
  match_fun = list(`<=`, `>=`)
) %>%
  group_by(start_dates, end_dates) %>%
  summarise(new_goal_column = sum(x))

Output

  start_dates end_dates  new_goal_column
  <date>      <date>               <dbl>
1 2021-01-01  2021-01-06              19
2 2021-01-07  2021-01-10               6

You can also try using data.table and joining.

library(data.table)

setDT(date_df)
setDT(df)

df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
  , .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]

Output

   start_dates  end_dates new_goal_column
1:  2021-01-01 2021-01-06              19
2:  2021-01-07 2021-01-10               6
  •  Tags:  
  • r
  • Related