Home > OS >  How do I drop all observations except the last of a pattern?
How do I drop all observations except the last of a pattern?

Time:10-24

I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?

I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2

df1

TIME        ID        D
12:30:10    2         0
12:30:42    2         0
12:30:59    2         1
12:31:20    2         0
12:31:50    2         0
12:32:11    2         0
12:32:45    2         1
12:33:10    2         1
12:33:33    2         1
12:33:55    2         1
12:34:15    2         0
12:34:30    2         0
12:35:30    2         0
12:36:30    2         0
12:36:45    2         0
12:37:00    2         0
12:38:00    2         1

I want to end up with the following df2

df2

TIME        ID        D
12:33:55    2         1
12:34:15    2         0
12:34:30    2         0
12:35:30    2         0
12:36:30    2         0
12:36:45    2         0
12:37:00    2         0
12:38:00    2         1

Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.

CodePudding user response:

Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.

keep_last_pattern <- function(data, col){
  x <- data[[col]]
  if(x[length(x)] == 0) x[length(x)] <- 1
  #
  i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
  r <- rle(i)
  l <- length(r$lengths)
  n <- which(as.logical(r$values))
  r$values[ n[-length(n)] ] <- 0
  r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
  j <- as.logical(inverse.rle(r))
  #
  data[j, ]
}

keep_last_pattern(df1, "D")

df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")

CodePudding user response:

Do you want to rows the sequence in each ID between second last 1 and last 1 ?

Here is a function to do that which can be applied for each ID.

library(dplyr)

extract_sequence <- function(x) {
  inds <- which(x == 1)
  inds[length(inds) - 1]:inds[length(inds)]
}

df %>%
  group_by(ID) %>%
  slice(extract_sequence(D)) %>%
  ungroup

#  TIME        ID     D
#  <chr>    <int> <int>
#1 12:33:55     2     1
#2 12:34:15     2     0
#3 12:34:30     2     0
#4 12:35:30     2     0
#5 12:36:30     2     0
#6 12:36:45     2     0
#7 12:37:00     2     0
#8 12:38:00     2     1

CodePudding user response:

Not sure this will help as it's unclear what your pattern is. Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:

set.seed(123)
df <- data.frame(
  grp = sample(LETTERS[1:3], 10, replace = TRUE),
  x = 1:10,
  y = c(0,1,0,0,1,1,1,1,1,1),
  pattern = rep(c("TRUE", "FALSE"),5)
)

If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:

df %>%
  filter(pattern == "TRUE") %>%
  group_by(grp) %>%
  slice_tail(.)
# A tibble: 3 x 4
# Groups:   grp [3]
  grp       x     y pattern
  <chr> <int> <dbl> <chr>  
1 A         1     0 TRUE   
2 B         9     1 TRUE   
3 C         5     1 TRUE 
  •  Tags:  
  • r
  • Related