I have a dataframe that represents a two-year daily time series of temperature for one river. For this river, I would like to know what day of year (doy
):
- temperature is sustained greater than or equal to 10 degrees
- sustained is when there are no more dips below 10 until after the peak temperature of the year, such as during autumn or winter
- temperature is sustained less than or equal to 10 degrees
- sustained is when there are no more peaks above 10 until the following year
I run into errors when I try to calculate 2 because there are multiple TRUE
answers for the code to choose from. I would like to know how I can make the code go with the first TRUE
answer if there are multiple TRUE
answers.
Example Dataset
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- "01417500"
parameterCd <- "00010" # water temperature
statCd <- "00003" # mean
startDate <- "2015-01-01"
endDate <- "2016-12-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd)
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# Visually inspect the time series
ggplot(data = dat, aes(x = Date, y = temperature))
geom_point()
theme_bw()
Code for 1 & 2 where 2 is having issues because there are multiple TRUE
statements to choose from
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(year) %>%
mutate(gt_10 = temperature >= 10, # greater than or equal to 10 degrees
lt_10 = temperature <= 10, # less than or equal to 10 degrees
peak_doy = doy[which.max(temperature)], # what doy is max temperature
below_peak = doy < peak_doy, # is the observed doy less than the peak temperature doy
after_peak = doy > peak_doy, # is the observed doy greater than the peak temperature doy
test_above = ave(gt_10, cumsum(!gt_10), FUN = cumsum), # counts number of days above 10 degree threshold
test_below = ave(lt_10, cumsum(!lt_10), FUN = cumsum)) %>% # counts number of days below 10 degree threshold
summarise(first_above_10_sustained = doy[below_peak == T & test_above == 14]-13, # answer to 1
first_below_10_sustained = doy[after_peak == T & test_below == 14]-13) # answer to 2
- To answer 2, the code is looking at those times when temperature is after the annual peak temperature (i.e.,
after_peak == T
) and when temperature has been below the 10 threshold for 14 consecutive days (i.e.,test_below == 14
). Thetest_below == 14
is where the error lies because there are multiple times when this occurs. Yes, you could change the threshold of consecutive days to some value > 14 but that is beside the point. How can I get the code to accept the firstTRUE
answer is there are multipleTRUE
answers?
I have a similar SO question here but my answer only works when there is not multiple TRUE
answers to choose from.
CodePudding user response:
There are a couple of tricks I would employ here:
- Create a column that specifies whether the temperature is above or below 10 degrees.
- Get the
rleid
of this column, which will group all of the consecutive days above or below the threshold of 10 degrees. - Find the maximum temperature of each year, and store it in a column.
- The
rleid
which contains that maximum temperature will be the dates where temperature is sustained > 10 degrees for that year according to your definition - Do the same thing with minimum temperatures, but subtract 6 months when calculating the year to group by when calculating the minimum for that year. This will not make a difference to the end result but allows a calculation of the minimum temperature in the winter period:
df <- dat %>%
mutate(year = year(Date)) %>%
group_by(year) %>%
mutate(max_temp = max(temperature)) %>%
ungroup() %>%
mutate(above_ten = temperature >= 10,
run = factor(data.table::rleid(above_ten))) %>%
group_by(run) %>%
mutate(sustained_hi = max(temperature) == max(max_temp)) %>%
ungroup() %>%
mutate(year = year(Date - months(6))) %>%
group_by(year) %>%
mutate(min_temp = min(temperature)) %>%
group_by(run) %>%
mutate(sustained_lo = min(temperature) == min(min_temp)) %>%
mutate(group = ifelse(sustained_hi, 'High',
ifelse(sustained_lo, 'Low',
'Unsustained'))) %>%
select(site_no, Date, temperature, group, run)
This results in:
df
#> # A tibble: 731 x 5
#> # Groups: run [27]
#> site_no Date temperature group run
#> <chr> <date> <dbl> <chr> <fct>
#> 1 01417500 2015-01-01 0.7 Low 1
#> 2 01417500 2015-01-02 1.1 Low 1
#> 3 01417500 2015-01-03 1 Low 1
#> 4 01417500 2015-01-04 2.5 Low 1
#> 5 01417500 2015-01-05 2 Low 1
#> 6 01417500 2015-01-06 0.3 Low 1
#> 7 01417500 2015-01-07 0.2 Low 1
#> 8 01417500 2015-01-08 0.2 Low 1
#> 9 01417500 2015-01-09 0.3 Low 1
#> 10 01417500 2015-01-10 0.3 Low 1
#> # ... with 721 more rows
#> # i Use `print(n = ...)` to see more rows
And we can see the result by plotting like this:
ggplot(df, aes(x = Date, y = temperature, color = group))
geom_point()
scale_color_manual(limits = c('High', 'Unsustained', 'Low'),
values = c('orange', 'gray', 'steelblue'))
theme_bw()
And we can get a nice little data frame of the start and end dates of our sustained high / low temperatures by doing:
df %>%
filter(group != 'Unsustained') %>%
group_by(run) %>%
summarize(Date = c(first(Date), last(Date)),
Event = paste('Sustained', first(group), c('Start', 'End'))) %>%
ungroup() %>%
select(-run)
#> # A tibble: 10 x 2
#> Date Event
#> <date> <chr>
#> 1 2015-01-01 Sustained Low Start
#> 2 2015-04-28 Sustained Low End
#> 3 2015-04-29 Sustained High Start
#> 4 2015-07-16 Sustained High End
#> 5 2015-11-08 Sustained Low Start
#> 6 2016-03-31 Sustained Low End
#> 7 2016-05-18 Sustained High Start
#> 8 2016-10-09 Sustained High End
#> 9 2016-10-23 Sustained Low Start
#> 10 2016-12-31 Sustained Low End