Home > front end >  How to filter out rows that follow a certain established threshold being reached in R?
How to filter out rows that follow a certain established threshold being reached in R?

Time:05-06

I have the following data:

group <- rep(letters[seq(from = 1, to = 3)], each = 4)
date <- c("1999-01-01", "1999-01-02", "1999-10-01", "1999-10-05",
           "1988-02-01", "1997-12-25", "1997-12-26", "1998-01-01",
           "2000-05-01", "2000-07-01", "2000-12-01", "2000-12-02")
day <- c(1,2,274,278,
         1,3616,3617,3623,
         1, 62,215,216)
diff <- c(0, 1, 272, 4, 
          0, 3615, 1, 6, 
          0, 61, 153, 1)
matrix <- matrix(c(group, date, day, diff), ncol = 4, byrow = F)
df <- as.data.frame(matrix)
colnames(df) <- c("group", "date", "day", "diff")
df

In this case, "diff" is the difference between consecutive dates, in days, by group. I am trying to filter out all rows after an arbitrary threshold of "diff" has been reached. For example, let's say this threshold is a difference of 100 days. I would want to eliminate all rows on and after the first value of "diff" that is greater than 100, by group. In other words, my output would look as follows:

group2 <- c("a", "a", 
           "b", 
           "c", "c")
date2 <- c("1999-01-01", "1999-01-02",
          "1988-02-01   ",
          "2000-05-01", "2000-07-01")
day2 <- c(1, 2,
         1,
         1, 62)
diff2 <- c(0,1,
           0,
           0,61)

matrix2 <- matrix(c(group2, date2, day2, diff2), ncol = 4, byrow = F)
df2 <- as.data.frame(matrix2)
colnames(df2) <- c("group", "date", "day", "diff")
df2

Is there some way to get this output? There are similar questions on Stack Overflow, but they do not accommodate groups, or do not work for my data. Filtering any value of "diff" less than 100 is not the solution, as it leaves me with dates that occurred AFTER the 100 day gap, which I do not want.

df %>%
  filter(diff < 100)

Again, I just want to find the first instance where diff > 100 and remove this row and all subsequent rows for that group. Any help here would be appreciated.

CodePudding user response:

df = data.frame(group, date, day, diff)

library(dplyr)
df %>%
  group_by(group) %>%
  filter(cumsum(diff > 100) == 0)
# # A tibble: 5 × 4
# # Groups:   group [3]
#   group date         day  diff
#   <chr> <chr>      <dbl> <dbl>
# 1 a     1999-01-01     1     0
# 2 a     1999-01-02     2     1
# 3 b     1988-02-01     1     0
# 4 c     2000-05-01     1     0
# 5 c     2000-07-01    62    61
  •  Tags:  
  • r
  • Related