I would like to know if there is a way in r to return a value for the number of times where a series of data exceeds a certain value for a number of consecutive days.
e.g. How many times in a year was x greater than 10 for at least 30 consecutive days?
I know that you can find how many instances x was greater than a certain value over the whole year, but I'm not sure how to test for instances that are consecutive.
Where Data
is a data.frame with Date, Year, and Value columns with daily data from 2010-2020:
Data %>%
group_by(Year) %>%
filter(Value >= 10) %>%
summarize(exceedances = n())
Here is an example of daily data from 2018-2021 with random values from 0-25:
library(tidyverse)
library(dplyr)
library(lubridate)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
dat %>%
group_by(year) %>%
filter(value >= 10) %>%
summarize(exceedances = n())
The output:
# A tibble: 4 x 2
year exceedances
<dbl> <int>
1 2018 216
2 2019 247
3 2020 229
4 2021 217
Desired output (n of >= 30 consecutive exceedances is a guess):
# A tibble: 4 x 2
year n_exceedances_30_consec
<dbl> <int>
1 2018 1
2 2019 0
3 2020 2
4 2021 0
The trick with this is that if there are 40 consecutive exceedances, I need that to show as 1 instance only, not 10 instances where the previous 30 days were >= 10.
CodePudding user response:
You could use slider::slide_dbl
. Maybe a complex way of doing it, but you could
library(tidyverse)
library(lubridate)
library(slider)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
library(dplyr)
library(slider)
consecutive_days <- 10
dat |>
mutate(greater = if_else(value >= 10, TRUE, FALSE)) |>
mutate(consecutive = slide_dbl(greater, sum, .before = consecutive_days-1)) |>
filter(consecutive >= consecutive_days) |>
filter(greater) |>
group_by(year) |>
summarize(exceedances = n())
I'm obviously freestyling here as there's no data provided. Hopefully this fits your needs!
Edit - I'm editing in light of the data you sent. This should now work correctly.
This is if you're more of a base R user:
date_index <- sapply(dat$date, \(x) {
d <- dat[dat$date >= x-(consecutive_days-1) &
dat$date <= x, ]
d <- d[d$value >= 10, ]
nrow(d) >= consecutive_days
})
aggregate(x = dat[date_index, ]$value,
by = list(year = dat[date_index, ]$year),
FUN = length)
CodePudding user response:
I'm changing your data slightly as I wasn't getting any runs long enough.
set.seed(6)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = format(date, "%Y"),
value = sample(5:50, length(date), replace=TRUE)
)
year_runs <- sapply(unique(dat$year), function(y) {
runs <- rle(dat$value[dat$year==y] >= 10); sum(runs$length>=30 & runs$value)
})
year_runs
2018 2019 2020 2021
1 2 2 2
The secret here is the base function rle
(run length encoding). This is running separately by year; if a run is over the new year, it will be truncated.
In response to your edit that you want to count the run in the year it ends, you can then skip the sapply and use rle
on the whole dataset, remove the runs that are too short, then invert it back and find the ends
runs <- rle(dat$value >= 10)
runs$values [runs$lengths < 30] <- FALSE
run_ends <- which(diff(inverse.rle(runs)) == -1) 1
table(dat$year[run_ends])
2018 2019 2020 2021
1 2 2 2