Identifying/describing sequences of consecutive days with certain value within a vector-CodePudding

I have a large dataset containing daily values indicating whether that particular day in the year was especially hot or not (indicated by 1 or 0). I aim to identify sequences of 3 or more especially hot days and create a new dataset that contains the length and the start and end date of each.

I'm a bit stuck on how to go about this.

An example of my dataset:

hotday <- c(0,1,0,1,1,1,0,0,1,1,1,1,0)
dates <- seq.Date(from=as.Date("1990-06-01"), by="day",length.out = length(hotday))
df <- data.frame(dates,hotday)
df
        dates hotday
1  1990-06-01      0
2  1990-06-02      1
3  1990-06-03      0
4  1990-06-04      1
5  1990-06-05      1
6  1990-06-06      1
7  1990-06-07      0
8  1990-06-08      0
9  1990-06-09      1
10 1990-06-10      1
11 1990-06-11      1
12 1990-06-12      1
13 1990-06-13      0

The output I would like to achieve should look as follows:

   startdate    enddate length
1 1990-06-04 1990-06-06      3
2 1990-06-09 1990-06-12      4

Thank you for the help, I am willing to take any approach or suggestion.

CodePudding user response：

If you prefer tidyverse syntax you could do

library(dplyr) 

df %>% 
  mutate(run = cumsum(c(1, abs(diff(hotday))))) %>%
  filter(hotday == 1) %>%
  group_by(run) %>%
  summarize(startdate = first(dates), enddate = last(dates), length = n()) %>%
  ungroup() %>%
  select(-run) %>%
  filter(length >= 3)
#> # A tibble: 2 x 3
#>   startdate  enddate    length
#>   <date>     <date>      <int>
#> 1 1990-06-04 1990-06-06      3
#> 2 1990-06-09 1990-06-12      4

^{Created on 2022-09-30 with reprex v2.0.2}

CodePudding user response：

An option is using rle in base R or rleid from data.table - grouped by the run-length-id of the 'hotday' summarise by taking the first and last of the 'dates' where the number of rows is greater than 2 and 1 is present in hotday

library(data.table)
setDT(df)[, if(.N >2 && 1 %in% hotday) .(startdate = first(dates),
  enddate = last(dates), length = .N) ,.(grp = rleid(hotday))][,
    grp := NULL][]

-output

    startdate    enddate length
       <Date>     <Date>  <int>
1: 1990-06-04 1990-06-06      3
2: 1990-06-09 1990-06-12      4

CodePudding user response：

1) Use groupid from collapse to assign a unique number to each run and then for each such run use filter to keep it only if it represents a run of hotday values and has a run length of at least 3. Then form the desired summary data and at the end remove the groupid.

library(dplyr)
library(collapse)

df %>%
  group_by(groupid = groupid(hotday)) %>%
  filter(first(hotday) == 1 && n() >= 3) %>%
  summarize(start = first(dates), end = last(dates), length = n()) %>%
  ungroup %>%
  select(-groupid)
## # A tibble: 2 × 3
##   start      end        length
##   <date>     <date>      <int>
## 1 1990-06-04 1990-06-06      3
## 2 1990-06-09 1990-06-12      4

2) A base R approach that follows the same approach as (1) is shown below. g corresponds to groupid there.

stats <- function(x) data.frame(start = x[1], end = tail(x, 1), length = length(x))

df |>
  transform(g = with(rle(hotday), rep(seq_along(lengths), lengths))) |>
  subset(hotday == 1 & ave(g, g, FUN = length) >= 3) |>
  with(by(dates, g, stats)) |>
  do.call(what = "rbind")
##        start        end length
## 4 1990-06-04 1990-06-06      3
## 6 1990-06-09 1990-06-12      4

3) A completely different approach using only base R is to convert hotday to a character string and use gregexpr to find the starts and lengths of the 1 runs.

g <- gregexpr("1{3,}", paste(df$hotday, collapse = ""))[[1]]
len <- attr(g, "match.length")

with(df, data.frame(start = dates[g], end = dates[g len-1], lengths = len))
##        start        end lengths
## 1 1990-06-04 1990-06-06       3
## 2 1990-06-09 1990-06-12       4

4) Another approach is based on the idea that the number of 0's prior to each run of 1's is the same for each element of a particular run of 1's. This approach uses only base R.

df |>
  transform(g = cumsum(hotday == 0)) |>
  subset(hotday == 1) |>
  with(data.frame(start = dates[match(unique(g), g)], 
                  end = dates[findInterval(unique(g), g)],
                  len = c(table(g)))) |>
  subset(len >= 3)
##        start        end len
## 2 1990-06-04 1990-06-06   3
## 4 1990-06-09 1990-06-12   4

4a) The same idea could be used in (1) to eliminate the need for the groupid function.

df %>%
  group_by(groupid = cumsum(hotday == 0)) %>%
  filter(hotday == 1) %>%
  filter(n() >= 3) %>%
  summarize(start = first(dates), end = last(dates), length = n()) %>%
  ungroup %>%
  select(-groupid)