I have the following dataframe:
ID Date Value
1 2010-08-01 6
1 2011-05-01 8
1 2011-12-01 7
1 2012-08-01 6
1 2013-01-01 6
1 2014-04-01 10
1 2014-08-01 8
1 2015-01-01 9
1 2016-01-01 9
1 2017-01-01 8
2 1996-01-01 5
2 1998-01-01 8
2 2010-01-01 5
2 2014-08-01 8
2 2016-01-01 8
I want to detect, for each ID
, the first row that Value
is equal to or exceeds a set threshold (in this case: 8) and remains over this threshold for the following observations. So, based on the data frame, the result I would like to obtain would be the following:
ID Date Value
1 2014-04-01 10
2 2014-08-01 8
I truly appreciate your help! Thanks! Let me know if you need any additional information or clarification
CodePudding user response:
Not clear about the remains
case - i.e. suppose an ID have 'Value' less than 8 for the last element, then it is not clear whether to return row for that ID or not. The below solution will skip those 'ID' (if exists)
library(dplyr)
library(data.table)
df1 %>%
group_by(ID) %>%
slice({ind = Value >= 8
grp = rleid(ind)
if(all(ind[grp == max(grp)]))
first(row_number()[grp == max(grp)]) else 0
}) %>%
ungroup
-output
# A tibble: 2 × 3
ID Date Value
<int> <chr> <int>
1 1 2014-04-01 10
2 2 2014-08-01 8
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), Date = c("2010-08-01", "2011-05-01", "2011-12-01",
"2012-08-01", "2013-01-01", "2014-04-01", "2014-08-01", "2015-01-01",
"2016-01-01", "2017-01-01", "1996-01-01", "1998-01-01", "2010-01-01",
"2014-08-01", "2016-01-01"), Value = c(6L, 8L, 7L, 6L, 6L, 10L,
8L, 9L, 9L, 8L, 5L, 8L, 5L, 8L, 8L)), class = "data.frame", row.names = c(NA,
-15L))
CodePudding user response:
A base R option using rle
to check the starting point of exceeding the threshold and from which we always have the following values greater than the threshold
th <- 8
subset(
df,
ave(Value >= th,
ID,
FUN = function(v) {
with(
rle(v),
cumsum(lengths)[length(lengths) - 1] 1 == seq_along(v) & values[length(lengths)]
)
}
)
)
gives
ID Date Value
6 1 2014-04-01 10
14 2 2014-08-01 8
CodePudding user response:
Alt 1. Using rev
to perform the check of values from the end of each group. Grab the index of the first FALSE
(which.min
), i.e. the last value below the threshold in the original order. Convert the "reversed" index to original indexing by subtracting it from group length .N
.
library(data.table)
setDT(d)
d[ , {
rv = rev(Value >= 8)
.SD[.N - which.min(rv) 2]
}, by = ID]
# ID Date Value
# 1: 1 2014-04-01 10
# 2: 2 2014-08-01 8
Maybe add a check if values drop below 8 before the end, i.e. check the first element in the reversed vector:
d[ , {
rv = rev(Value >= 8)
if(!rv[1]) NA else {
.SD[.N - which.min(rv) 2]
}
}, by = ID]
Alt 2. Using run-length encoding:
d[ , if(Value[.N] >= 8){
g = rleid(Value >= 8)
.SD[g == max(g)][1]
}, by = ID]