I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 2
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
CodePudding user response:
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
CodePudding user response:
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))