Get row when a value first exceeds a threshold and then remains above it-CodePudding

I have the following dataframe:

ID Date        Value
1  2010-08-01  6
1  2011-05-01  8
1  2011-12-01  7
1  2012-08-01  6
1  2013-01-01  6
1  2014-04-01  10
1  2014-08-01  8
1  2015-01-01  9
1  2016-01-01  9
1  2017-01-01  8
2  1996-01-01  5
2  1998-01-01  8
2  2010-01-01  5
2  2014-08-01  8
2  2016-01-01  8

I want to detect, for each ID, the first row that Value is equal to or exceeds a set threshold (in this case: 8) and remains over this threshold for the following observations. So, based on the data frame, the result I would like to obtain would be the following:

ID Date       Value 
1 2014-04-01  10
2 2014-08-01  8

I truly appreciate your help! Thanks! Let me know if you need any additional information or clarification

CodePudding user response：

Not clear about the remains case - i.e. suppose an ID have 'Value' less than 8 for the last element, then it is not clear whether to return row for that ID or not. The below solution will skip those 'ID' (if exists)

library(dplyr)
library(data.table)
df1 %>% 
  group_by(ID) %>%
   slice({ind = Value >= 8
         grp = rleid(ind)
        if(all(ind[grp == max(grp)])) 
      first(row_number()[grp == max(grp)]) else 0
  }) %>% 
ungroup

-output

# A tibble: 2 × 3
     ID Date       Value
  <int> <chr>      <int>
1     1 2014-04-01    10
2     2 2014-08-01     8

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), Date = c("2010-08-01", "2011-05-01", "2011-12-01", 
"2012-08-01", "2013-01-01", "2014-04-01", "2014-08-01", "2015-01-01", 
"2016-01-01", "2017-01-01", "1996-01-01", "1998-01-01", "2010-01-01", 
"2014-08-01", "2016-01-01"), Value = c(6L, 8L, 7L, 6L, 6L, 10L, 
8L, 9L, 9L, 8L, 5L, 8L, 5L, 8L, 8L)), class = "data.frame", row.names = c(NA, 
-15L))

CodePudding user response：

A base R option using rle to check the starting point of exceeding the threshold and from which we always have the following values greater than the threshold

th <- 8

subset(
  df,
  ave(Value >= th,
    ID,
    FUN = function(v) {
      with(
        rle(v),
        cumsum(lengths)[length(lengths) - 1]   1 == seq_along(v) & values[length(lengths)]
      )
    }
  )
)

gives

   ID       Date Value
6   1 2014-04-01    10
14  2 2014-08-01     8

CodePudding user response：

Alt 1. Using rev to perform the check of values from the end of each group. Grab the index of the first FALSE (which.min), i.e. the last value below the threshold in the original order. Convert the "reversed" index to original indexing by subtracting it from group length .N.

library(data.table)
setDT(d)

d[ , {
  rv = rev(Value >= 8)
  .SD[.N - which.min(rv)   2]
  }, by = ID] 

#    ID       Date Value
# 1:  1 2014-04-01    10
# 2:  2 2014-08-01     8

Maybe add a check if values drop below 8 before the end, i.e. check the first element in the reversed vector:

d[ , {
  rv = rev(Value >= 8)
  if(!rv[1]) NA else {
    .SD[.N - which.min(rv)   2]
  }
}, by = ID]

Alt 2. Using run-length encoding:

d[ , if(Value[.N] >= 8){
  g = rleid(Value >= 8)
  .SD[g == max(g)][1]
  }, by = ID]