For each event, I am trying to calculate how many previous events happened in the previous winter (and the winter previous to that) within the same group (class and area).
Here is a simple dataset (my actual one has 85,000 records)
# creating a simple data.frame
class <- c(1,1,1,1,1,2,2,2,2)
area <- c("a", "a","b", "a", "a","b", "a", "a","b" )
event <- as.Date(c("2023-04-01", "2022-12-01", "2022-01-01",
"2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
df <- data.frame(class, area, event)
str(df) # checking the structure of the data.frame
df <- df[order(class, area, event),] # sorting the order
df
df$events_in_previous_winter <- c(0,1,1,2,0,0,0,0,0) # this is the desired answer
df
I have tried using dplyr / group_by(class, area) and mutate to count, but I can't get it to work.
I define winter as the months of 12, 1, and 2 (Dec, Jan, Feb).
I would like to know for each group (unique pairing of class and area) how many "events" occurs in the previous winter to that event.
Any ideas?
CodePudding user response:
I would use a grouped summary to create a separate previous_winter_events
dataframe containing the number of events per winter. You can then get the number of events per previous winter using dplyr::lag()
. (You can also get events two winters ago by setting lag(x, n = 2)
, etc.) Then, merge these values back into your original dataframe using a left join.
This solution uses a winter_year
helper column to group each December with the following January and February even though they're in different years. I use tidyr::complete()
to add years with no events to previous_winter_events
.
library(dplyr)
library(tidyr)
library(lubridate)
event_df <- event_df %>%
mutate(winter_year = year(event %m-% months(2))) %>%
arrange(event_class, area, event)
previous_winter_events <- event_df %>%
complete(event_class, area, winter_year = full_seq(winter_year, 1)) %>%
group_by(event_class, area, winter_year) %>%
summarize(
events_this_winter = sum(month(event) %in% c(12, 1, 2)),
.groups = "drop_last"
) %>%
mutate(
events_in_previous_winter = dplyr::lag(events_this_winter, default = 0)
) %>%
ungroup()
event_df <- event_df %>%
left_join(previous_winter_events) %>%
select(!c(winter_year, events_this_winter)) # remove helper columns
event_df
Output:
event_class area event events_in_previous_winter
1 1 a 2018-01-01 0
2 1 a 2021-12-01 0
3 1 a 2022-01-31 0
4 1 a 2022-12-01 2
5 1 a 2022-12-01 2
6 1 a 2023-04-01 2
7 1 b 2022-01-01 0
8 2 a 2020-04-01 0
9 2 a 2022-04-01 0
10 2 b 2022-04-01 0
11 2 b 2022-12-01 0
Data:
# Added a couple additional test cases from OP comments.
# Changed names of `df` and `class` because those are function names in R.
event_df <- data.frame(
event_class = c(1,1,1,1,1,1,1,2,2,2,2),
area = c("a", "a","a", "a","b", "a", "a","b", "a", "a","b"),
event = as.Date(c("2018-01-01", "2023-04-01", "2022-12-01", "2022-01-31",
"2022-01-01", "2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
)