How to check number of instances where x threshold is exceeded for n consecutive days in a series of-CodePudding

I would like to know if there is a way in r to return a value for the number of times where a series of data exceeds a certain value for a number of consecutive days.

e.g. How many times in a year was x greater than 10 for at least 30 consecutive days?

I know that you can find how many instances x was greater than a certain value over the whole year, but I'm not sure how to test for instances that are consecutive.

Where Data is a data.frame with Date, Year, and Value columns with daily data from 2010-2020:

Data %>%
group_by(Year) %>%
filter(Value >= 10) %>%
summarize(exceedances = n())

Here is an example of daily data from 2018-2021 with random values from 0-25:

library(tidyverse)
library(dplyr)
library(lubridate)

value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")

dat = data.frame(date = date,
                 year = year(date),
                 value = value)

dat %>%
  group_by(year) %>%
  filter(value >= 10) %>%
  summarize(exceedances = n())

The output:

# A tibble: 4 x 2
   year exceedances
  <dbl>       <int>
1  2018         216
2  2019         247
3  2020         229
4  2021         217

Desired output (n of >= 30 consecutive exceedances is a guess):

# A tibble: 4 x 2
   year n_exceedances_30_consec
  <dbl>       <int>
1  2018         1
2  2019         0
3  2020         2
4  2021         0

The trick with this is that if there are 40 consecutive exceedances, I need that to show as 1 instance only, not 10 instances where the previous 30 days were >= 10.

CodePudding user response：

You could use slider::slide_dbl. Maybe a complex way of doing it, but you could

library(tidyverse)
library(lubridate)
library(slider)

value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")

dat = data.frame(date = date,
                 year = year(date),
                 value = value)
library(dplyr)
library(slider)

consecutive_days <- 10

dat |>
  mutate(greater = if_else(value >= 10, TRUE, FALSE)) |>
  mutate(consecutive = slide_dbl(greater, sum, .before = consecutive_days-1)) |>
  filter(consecutive >= consecutive_days) |>
  filter(greater) |> 
  group_by(year) |>
  summarize(exceedances = n())

I'm obviously freestyling here as there's no data provided. Hopefully this fits your needs!

Edit - I'm editing in light of the data you sent. This should now work correctly.

This is if you're more of a base R user:

date_index <- sapply(dat$date, \(x) {
  d <- dat[dat$date >= x-(consecutive_days-1) &
             dat$date <= x, ]
  d <- d[d$value >= 10, ]
  nrow(d) >= consecutive_days
})

aggregate(x = dat[date_index, ]$value,
          by = list(year = dat[date_index, ]$year),
          FUN = length)

CodePudding user response：

I'm changing your data slightly as I wasn't getting any runs long enough.

set.seed(6)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
                 year = format(date, "%Y"),
                 value = sample(5:50, length(date), replace=TRUE)
)
year_runs <- sapply(unique(dat$year), function(y) {
  runs <- rle(dat$value[dat$year==y] >= 10); sum(runs$length>=30 & runs$value)
})
year_runs
2018 2019 2020 2021 
   1    2    2    2

The secret here is the base function rle (run length encoding). This is running separately by year; if a run is over the new year, it will be truncated.

In response to your edit that you want to count the run in the year it ends, you can then skip the sapply and use rle on the whole dataset, remove the runs that are too short, then invert it back and find the ends

runs <- rle(dat$value >= 10)
runs$values [runs$lengths < 30] <- FALSE
run_ends <- which(diff(inverse.rle(runs)) == -1)   1
table(dat$year[run_ends])
2018 2019 2020 2021 
   1    2    2    2