Home > Software engineering >  Finding number of events in previous winter within group
Finding number of events in previous winter within group

Time:10-27

For each event, I am trying to calculate how many previous events happened in the previous winter (and the winter previous to that) within the same group (class and area).

Here is a simple dataset (my actual one has 85,000 records)

# creating a simple data.frame

class <- c(1,1,1,1,1,2,2,2,2)
area <-  c("a", "a","b", "a", "a","b",  "a", "a","b" )
event <- as.Date(c("2023-04-01", "2022-12-01", "2022-01-01",
                   "2021-12-01", "2022-12-01", "2022-12-01",
                   "2020-04-01", "2022-04-01", "2022-04-01"))

df <- data.frame(class, area, event)



str(df) # checking the structure of the data.frame

df <- df[order(class, area, event),] # sorting the order

df

df$events_in_previous_winter <- c(0,1,1,2,0,0,0,0,0)  # this is the desired answer

df

I have tried using dplyr / group_by(class, area) and mutate to count, but I can't get it to work.

I define winter as the months of 12, 1, and 2 (Dec, Jan, Feb).

I would like to know for each group (unique pairing of class and area) how many "events" occurs in the previous winter to that event.

Any ideas?

CodePudding user response:

I would use a grouped summary to create a separate previous_winter_events dataframe containing the number of events per winter. You can then get the number of events per previous winter using dplyr::lag(). (You can also get events two winters ago by setting lag(x, n = 2), etc.) Then, merge these values back into your original dataframe using a left join.

This solution uses a winter_year helper column to group each December with the following January and February even though they're in different years. I use tidyr::complete() to add years with no events to previous_winter_events.

library(dplyr)
library(tidyr)
library(lubridate)

event_df <- event_df %>% 
  mutate(winter_year = year(event %m-% months(2))) %>% 
  arrange(event_class, area, event)

previous_winter_events <- event_df %>% 
  complete(event_class, area, winter_year = full_seq(winter_year, 1)) %>% 
  group_by(event_class, area, winter_year) %>% 
  summarize(
    events_this_winter = sum(month(event) %in% c(12, 1, 2)),
    .groups = "drop_last"
  ) %>% 
  mutate(
    events_in_previous_winter = dplyr::lag(events_this_winter, default = 0)
  ) %>% 
  ungroup()

event_df <- event_df %>% 
  left_join(previous_winter_events) %>% 
  select(!c(winter_year, events_this_winter)) # remove helper columns

event_df

Output:

   event_class area      event events_in_previous_winter
1            1    a 2018-01-01                         0
2            1    a 2021-12-01                         0
3            1    a 2022-01-31                         0
4            1    a 2022-12-01                         2
5            1    a 2022-12-01                         2
6            1    a 2023-04-01                         2
7            1    b 2022-01-01                         0
8            2    a 2020-04-01                         0
9            2    a 2022-04-01                         0
10           2    b 2022-04-01                         0
11           2    b 2022-12-01                         0

Data:

# Added a couple additional test cases from OP comments.
# Changed names of `df` and `class` because those are function names in R.
event_df <- data.frame(
  event_class = c(1,1,1,1,1,1,1,2,2,2,2),
  area = c("a", "a","a", "a","b", "a", "a","b", "a", "a","b"),
  event = as.Date(c("2018-01-01", "2023-04-01", "2022-12-01", "2022-01-31", 
                    "2022-01-01", "2021-12-01", "2022-12-01", "2022-12-01",
                    "2020-04-01", "2022-04-01", "2022-04-01"))
)
  • Related