I have a large dataset containing daily values indicating whether that particular day in the year was especially hot or not (indicated by 1 or 0). I aim to identify sequences of 3 or more especially hot days and create a new dataset that contains the length and the start and end date of each.
I'm a bit stuck on how to go about this.
An example of my dataset:
hotday <- c(0,1,0,1,1,1,0,0,1,1,1,1,0)
dates <- seq.Date(from=as.Date("1990-06-01"), by="day",length.out = length(hotday))
df <- data.frame(dates,hotday)
df
dates hotday
1 1990-06-01 0
2 1990-06-02 1
3 1990-06-03 0
4 1990-06-04 1
5 1990-06-05 1
6 1990-06-06 1
7 1990-06-07 0
8 1990-06-08 0
9 1990-06-09 1
10 1990-06-10 1
11 1990-06-11 1
12 1990-06-12 1
13 1990-06-13 0
The output I would like to achieve should look as follows:
startdate enddate length
1 1990-06-04 1990-06-06 3
2 1990-06-09 1990-06-12 4
Thank you for the help, I am willing to take any approach or suggestion.
CodePudding user response:
If you prefer tidyverse syntax you could do
library(dplyr)
df %>%
mutate(run = cumsum(c(1, abs(diff(hotday))))) %>%
filter(hotday == 1) %>%
group_by(run) %>%
summarize(startdate = first(dates), enddate = last(dates), length = n()) %>%
ungroup() %>%
select(-run) %>%
filter(length >= 3)
#> # A tibble: 2 x 3
#> startdate enddate length
#> <date> <date> <int>
#> 1 1990-06-04 1990-06-06 3
#> 2 1990-06-09 1990-06-12 4
Created on 2022-09-30 with reprex v2.0.2
CodePudding user response:
An option is using rle
in base R or rleid
from data.table
- grouped by the run-length-id of the 'hotday' summarise by taking the first
and last
of the 'dates' where the number of rows is greater than 2 and 1 is present in hotday
library(data.table)
setDT(df)[, if(.N >2 && 1 %in% hotday) .(startdate = first(dates),
enddate = last(dates), length = .N) ,.(grp = rleid(hotday))][,
grp := NULL][]
-output
startdate enddate length
<Date> <Date> <int>
1: 1990-06-04 1990-06-06 3
2: 1990-06-09 1990-06-12 4
CodePudding user response:
1) Use groupid
from collapse to assign a unique number to each run and then for each such run use filter
to keep it only if it represents a run of hotday
values and has a run length of at least 3. Then form the desired summary data and at the end remove the groupid
.
library(dplyr)
library(collapse)
df %>%
group_by(groupid = groupid(hotday)) %>%
filter(first(hotday) == 1 && n() >= 3) %>%
summarize(start = first(dates), end = last(dates), length = n()) %>%
ungroup %>%
select(-groupid)
## # A tibble: 2 × 3
## start end length
## <date> <date> <int>
## 1 1990-06-04 1990-06-06 3
## 2 1990-06-09 1990-06-12 4
2) A base R approach that follows the same approach as (1) is shown below. g corresponds to groupid
there.
stats <- function(x) data.frame(start = x[1], end = tail(x, 1), length = length(x))
df |>
transform(g = with(rle(hotday), rep(seq_along(lengths), lengths))) |>
subset(hotday == 1 & ave(g, g, FUN = length) >= 3) |>
with(by(dates, g, stats)) |>
do.call(what = "rbind")
## start end length
## 4 1990-06-04 1990-06-06 3
## 6 1990-06-09 1990-06-12 4
3) A completely different approach using only base R is to convert hotday
to a character string and use gregexpr
to find the starts and lengths of the 1 runs.
g <- gregexpr("1{3,}", paste(df$hotday, collapse = ""))[[1]]
len <- attr(g, "match.length")
with(df, data.frame(start = dates[g], end = dates[g len-1], lengths = len))
## start end lengths
## 1 1990-06-04 1990-06-06 3
## 2 1990-06-09 1990-06-12 4
4) Another approach is based on the idea that the number of 0's prior to each run of 1's is the same for each element of a particular run of 1's. This approach uses only base R.
df |>
transform(g = cumsum(hotday == 0)) |>
subset(hotday == 1) |>
with(data.frame(start = dates[match(unique(g), g)],
end = dates[findInterval(unique(g), g)],
len = c(table(g)))) |>
subset(len >= 3)
## start end len
## 2 1990-06-04 1990-06-06 3
## 4 1990-06-09 1990-06-12 4
4a) The same idea could be used in (1) to eliminate the need for the groupid function.
df %>%
group_by(groupid = cumsum(hotday == 0)) %>%
filter(hotday == 1) %>%
filter(n() >= 3) %>%
summarize(start = first(dates), end = last(dates), length = n()) %>%
ungroup %>%
select(-groupid)