Repeatedly count events before a certain date in R-CodePudding

I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.

Example data (my actual dataset has ~6000 observations):

data<-read.table(header=T, text="
 unit  eventdate eventmonth  sampledate  year
 a 1996-06-01  06  1996-08-01  1996
 a 1997-09-03  09  1997-08-02  1997
 a 1998-05-15  05  1998-08-03  1998
 a NA  NA  1999-08-02  1999
 b 1996-05-31  05  1996-08-01  1996
 b 1997-05-31  05  1997-08-02  1997
 b 1998-05-15  05  1998-08-03  1998
 b 1999-05-16  05  1999-08-02  1999")

Output data should look something like this:

year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 2
1999 b 4 1

Note that in 1997 in unit a, the event is not counted because it happened after the sample date.

For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.

numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit), 
                     summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)

CodePudding user response：

This might work:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data<-read.table(header=T, text="
 unit  eventdate eventmonth  sampledate  year
 a 1996-06-01  06  1996-08-01  1996
 a 1997-09-03  09  1997-08-02  1997
 a 1998-05-15  05  1998-08-03  1998
 a NA  NA  1999-08-02  1999
 b 1996-05-31  05  1996-08-01  1996
 b 1997-05-31  05  1997-08-02  1997
 b 1998-05-15  05  1998-08-03  1998
 b 1999-05-16  05  1999-08-02  1999")

data <- data %>% 
  mutate(eventdate = lubridate::ymd(eventdate), 
         sampledate = lubridate::ymd(sampledate))

data %>% 
  group_by(unit, year, eventmonth) %>% 
  summarise(numevent = sum(sampledate >= eventdate)) %>% 
  group_by(unit, year) %>% 
  summarise(nummonth = sum(numevent > 0), 
            numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups:   unit [2]
#>   unit   year nummonth numevent
#>   <chr> <int>    <int>    <int>
#> 1 a      1996        1        1
#> 2 a      1997        0        0
#> 3 a      1998        1        1
#> 4 a      1999       NA       NA
#> 5 b      1996        1        1
#> 6 b      1997        1        1
#> 7 b      1998        1        1
#> 8 b      1999        1        1

^{Created on 2023-01-08 by the reprex package (v2.0.1)}

Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.

CodePudding user response：

Try this?

data %>%
  group_by(unit) %>%
  mutate(
    numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)), 
    nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
  ) %>%
  ungroup()
# # A tibble: 8 × 7
#   unit  eventdate  eventmonth sampledate  year numevent nummonth
#   <chr> <date>          <int> <date>     <int>    <int>    <int>
# 1 a     1996-06-01          6 1996-08-01  1996        1        1
# 2 a     1997-09-03          9 1997-08-02  1997        1        1
# 3 a     1998-05-15          5 1998-08-03  1998        3        3
# 4 a     NA                 NA 1999-08-02  1999        3        3
# 5 b     1996-05-31          5 1996-08-01  1996        1        1
# 6 b     1997-05-31          5 1997-08-02  1997        2        1
# 7 b     1998-05-15          5 1998-08-03  1998        3        1
# 8 b     1999-05-16          5 1999-08-02  1999        4        1

Data

data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))