R: Removing All Rows After Condition Is Met-CodePudding

I am working with the R programming language.

I have the following dataset:

id = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
col1 = c(0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0)
col2 = c("A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A", "B")
my_data = data.frame(id, col1, col2)
my_data$row_num = 1:nrow(my_data)

For each unique ID - whenever col1 = 1 OR col2 = A, I want to delete all remaining rows that occur AFTER this condition (i.e. keep the first occurrence).

I found this question over here (How to Filter out Rows per Group after Condition Occurrs) in which an answer to a similar a question is provided. I tried to adapt this answer for my problem:

library(dplyr) 

my_data %>%
    group_by(id) %>% 
    slice(seq_len(which((col1 == 1) | (col2 == "A"))[1]))

Can someone please confirm if I have done this correctly? I am not sure if I have correctly inserted the OR statement within the "slice" function.

Thanks!

CodePudding user response：

Not sure if there's some tidy magic that can go the job but here's the "dumb" approach by partitioning the dataset by ID and looping through each of the parts:

filtered_data <- data.frame(matrix(NA, nrow=0, ncol=4))
colnames(filtered_data) <- colnames(my_data)

rows_added <- 0
for(id in 1:3) {
  relevant_data <- my_data[my_data$id == id,]
  for(row in 1:nrow(my_data)) {
      rows_added <- rows_added   1
      filtered_data[rows_added,] <- relevant_data[row,]
      jump_condition <- relevant_data[row, "col1"] == 1 | relevant_data[row, "col2"] == "A"
      if(jump_condition) {
        break
      }
  }
}

CodePudding user response：

You could filter based on all the row_number before the conditions happen using which like this:

id = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
col1 = c(0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0)
col2 = c("A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A", "B")
my_data = data.frame(id, col1, col2)
my_data$row_num = 1:nrow(my_data)

library(dplyr)
my_data %>%
  group_by(id) %>%
  filter(row_number() <= min(which((col1 == 1) | (col2 == "A"))))
#> # A tibble: 4 × 4
#> # Groups:   id [3]
#>      id  col1 col2  row_num
#>   <dbl> <dbl> <chr>   <int>
#> 1     1     0 A           1
#> 2     2     0 B           5
#> 3     2     0 A           6
#> 4     3     1 A          10

^{Created on 2023-01-20 with reprex v2.0.2}