Which is first day above and below threshold?-CodePudding

I have a dataframe that represents a two year daily time series of temperature for two rivers. For each river and each year, I would like to know what day of year:

temperature is greater than or equal to 15 degrees
temperature is sustained greater than or equal to 15 degrees (sustained is when there are no more dips below 15 until the autumn)
temperature is less than or equal to 15 degrees
temperature is sustained less than or equal to 15 degrees (sustained is when there are are no more peaks above 15 until the following spring)

Example timeseries

library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)

siteNumber <- c("01428500","01432805") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "2018-01-01"
endDate <- "2019-10-31"

dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"

# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature))  
  geom_point()  
  theme_bw()  
  facet_wrap(~site_no)

# Adds a new column for year and day of year (doy; Jan 1 = 1, Dec 31 = 365)
dat <- dat %>%
  mutate(year = year(Date),
         doy = yday(Date))

I have tried using the dplyr filter() function but have had little success

dat %>%
  group_by(site_no,year) %>%
  filter(temperature >= 15 & temperature <= 15)

The ideal output would look something like this:

   site_no year doy_firstabove15 doy_sustainedAbove15 doy_firstbelow15 doy_sustainedBelow15
1 01428500 2018              136                  144              253                  286
2 01428500 2019              140                  146              279                  289
3 01432805 2018              143                  143              272                  276
4 01432805 2019              140                  140              278                  278

CodePudding user response：

I like this question. Its specific, tricky, and has a good example. I would go about this by creating a rolling variable that checks for changes in temperature above and below 15. Then you can group by site and year and check for the first instance of the change in group id. Now I make several assumptions here that might be bad. 1) I am grouping by year, but it would likely be better to group by water year. You don't explicitly say that the year is important, but I suspect it is. 2) I also assume that Autumn is defined by the first day of month 9. Maybe you want the sustained to be through the fall instead of the first day. Same assumption for spring.

library(tidyverse)
library(dataRetrieval)

dat |>
  mutate(month = lubridate::month(Date),
         year = lubridate::year(Date)) |>
  group_by(site_no, year) |>
  mutate(gt_15 = temperature > 15,
         lt_15 = temperature < 15,
         chng1 = cumsum(gt_15 != lag(gt_15, def = first(gt_15))),
         chng2 = cumsum(lt_15 != lag(lt_15, def = first(lt_15))),
         )|>
  summarise(first_above_15 = Date[chng1==1][1],
            indx_autum = chng1[month == 9][1],
            first_above_15_sustained = Date[chng1== indx_autum][1],
            first_below_15 = Date[chng2==1][1],
            indx_spring = chng1[month == 3][1],
            first_below_15_sustained = Date[chng2== indx_spring][1],
            .groups = "drop") |>
  select(-contains("indx"))
#> # A tibble: 4 x 6
#>   site_no   year first_above_15 first_above_15_sustained first_belo~1 first_be~2
#>   <chr>    <dbl> <date>         <date>                   <date>       <date>    
#> 1 01428500  2018 2018-05-18     2018-05-24               2018-05-16   2018-01-01
#> 2 01428500  2019 2019-05-20     2019-05-26               2019-05-20   2019-01-01
#> 3 01432805  2018 2018-05-23     2018-05-23               2018-05-23   2018-01-01
#> 4 01432805  2019 2019-05-20     2019-05-20               2019-05-20   2019-02-22
#> # ... with abbreviated variable names 1: first_below_15,
#> #   2: first_below_15_sustained

CodePudding user response：

This works by splitting the annual time series into times before and after the peak temperature. Then it logically returns the day of year (doy) when temperature is first above 15 before the peak temperature and the first below 15 after the peak temperature.

Similarly, the sustained above 15 is determined logically by only looking at those doy when temperature is before the peak temperature and is above 15 for 7 consecutive days. The same logic is used for sustained below 15 by only looking at those doy when temperature is after the peak temperature and is below 15 for 7 consecutive days. Kinda hacky but it works.

dat %>%
  mutate(year = year(Date),
         doy = yday(Date)) %>%
  group_by(site_no, year) %>%
  mutate(gt_15 = temperature >= 15,
         lt_15 = temperature <= 15,
         peak_doy = doy[temperature == max(temperature)],
         below_peak = doy < peak_doy,
         after_peak = doy > peak_doy,
         test_above = ave(gt_15, cumsum(!gt_15), FUN = cumsum),
         test_below = ave(lt_15, cumsum(!lt_15), FUN = cumsum)) %>%
  summarise(first_above_15 = doy[below_peak == T & gt_15 == T][1],
            first_below_15 = doy[after_peak == T & lt_15 == T][1],
            first_above_15_sustained = doy[below_peak == T & test_above == 7]-6,
            first_below_15_sustained = doy[after_peak == T & test_below == 7]-6)