How do I drop all observations except the last of a pattern?-CodePudding

I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?

I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2

df1

TIME        ID        D
12:30:10    2         0
12:30:42    2         0
12:30:59    2         1
12:31:20    2         0
12:31:50    2         0
12:32:11    2         0
12:32:45    2         1
12:33:10    2         1
12:33:33    2         1
12:33:55    2         1
12:34:15    2         0
12:34:30    2         0
12:35:30    2         0
12:36:30    2         0
12:36:45    2         0
12:37:00    2         0
12:38:00    2         1

I want to end up with the following df2

df2

TIME        ID        D
12:33:55    2         1
12:34:15    2         0
12:34:30    2         0
12:35:30    2         0
12:36:30    2         0
12:36:45    2         0
12:37:00    2         0
12:38:00    2         1

Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.

CodePudding user response：

Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.

keep_last_pattern <- function(data, col){
  x <- data[[col]]
  if(x[length(x)] == 0) x[length(x)] <- 1
  #
  i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
  r <- rle(i)
  l <- length(r$lengths)
  n <- which(as.logical(r$values))
  r$values[ n[-length(n)] ] <- 0
  r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
  j <- as.logical(inverse.rle(r))
  #
  data[j, ]
}

keep_last_pattern(df1, "D")

df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")

CodePudding user response：

Do you want to rows the sequence in each ID between second last 1 and last 1 ?

Here is a function to do that which can be applied for each ID.

library(dplyr)

extract_sequence <- function(x) {
  inds <- which(x == 1)
  inds[length(inds) - 1]:inds[length(inds)]
}

df %>%
  group_by(ID) %>%
  slice(extract_sequence(D)) %>%
  ungroup

#  TIME        ID     D
#  <chr>    <int> <int>
#1 12:33:55     2     1
#2 12:34:15     2     0
#3 12:34:30     2     0
#4 12:35:30     2     0
#5 12:36:30     2     0
#6 12:36:45     2     0
#7 12:37:00     2     0
#8 12:38:00     2     1

CodePudding user response：

Not sure this will help as it's unclear what your pattern is. Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:

set.seed(123)
df <- data.frame(
  grp = sample(LETTERS[1:3], 10, replace = TRUE),
  x = 1:10,
  y = c(0,1,0,0,1,1,1,1,1,1),
  pattern = rep(c("TRUE", "FALSE"),5)
)

If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:

df %>%
  filter(pattern == "TRUE") %>%
  group_by(grp) %>%
  slice_tail(.)
# A tibble: 3 x 4
# Groups:   grp [3]
  grp       x     y pattern
  <chr> <int> <dbl> <chr>  
1 A         1     0 TRUE   
2 B         9     1 TRUE   
3 C         5     1 TRUE