Identify first unique value when multiple conditions are met using dplyr in R-CodePudding

I have a dataframe representing a two-year daily time series of temperature for two rivers. I have identified when the temperature is either above or below the peak temperature. I have also created a run-length ID column for when temperature is either above or below a threshold temperature of 10 degrees.

How can I get the first day of year for each site and year and the following conditions:

maximum run-length & below peak = TRUE
maximum run-length & above peak = TRUE

Example Data:

library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)

siteNumber <- c("01432805","01388000") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "1996-01-01"
endDate <- "1997-12-31"

dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"

# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature))  
  geom_point()  
  theme_bw()  
  facet_wrap(~site_no)

To create the columns described above

dat <- dat %>%
  mutate(year = year(Date),
         doy = yday(Date)) %>% # doy = day of year
  group_by(site_no, year) %>%
  mutate(lt_10 = temperature <= 10,
         peak_doy = doy[which.max(temperature)],
         below_peak = doy < peak_doy,
         after_peak = doy > peak_doy,
         run = data.table::rleid(lt_10))
View(dat)

The ideal output would look as follows:

   site_no year doy_below doy_after
1 01388000 1996       111       317
2 01388000 1997       112       312
3 01432805 1996       137       315
4 01432805 1997       130       294

doy_after = the first row for after_peak == TRUE & max(run) when group_by(site_no,year)

doy_below = the first row for below_peak == TRUE & max(run) when group_by(site_no,year)

For site_no = 01388000 in year = 1996, the max(run) when below_peak == TRUE is 4. The first row whenrun = 4 and below_peak == TRUE corresponds with date 1996-04-20 which has a doy = 111.

CodePudding user response：

As the data is already grouped, just summarise by extracting the 'doy' where the run is max for the subset of run where the values are TRUE in 'below_peak' or 'after_peak' and get the first element of 'doy'

library(dplyr)
dat %>% 
 summarise(doy_below = first(doy[run == max(run[below_peak])]), 
           doy_above = first(doy[run == max(run[after_peak])]), .groups = 'drop')

-output

# A tibble: 4 × 4
  site_no   year doy_below doy_above
  <chr>    <dbl>     <dbl>     <dbl>
1 01388000  1996       111       317
2 01388000  1997       112       312
3 01432805  1996       137       315
4 01432805  1997       130       294