Home > Software engineering >  R: Removing All Rows After Condition Is Met
R: Removing All Rows After Condition Is Met

Time:01-21

I am working with the R programming language.

I have the following dataset:

id = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
col1 = c(0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0)
col2 = c("A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A", "B")
my_data = data.frame(id, col1, col2)
my_data$row_num = 1:nrow(my_data)

For each unique ID - whenever col1 = 1 OR col2 = A, I want to delete all remaining rows that occur AFTER this condition (i.e. keep the first occurrence).

I found this question over here (How to Filter out Rows per Group after Condition Occurrs) in which an answer to a similar a question is provided. I tried to adapt this answer for my problem:

library(dplyr) 

my_data %>%
    group_by(id) %>% 
    slice(seq_len(which((col1 == 1) | (col2 == "A"))[1]))

Can someone please confirm if I have done this correctly? I am not sure if I have correctly inserted the OR statement within the "slice" function.

Thanks!

CodePudding user response:

Not sure if there's some tidy magic that can go the job but here's the "dumb" approach by partitioning the dataset by ID and looping through each of the parts:

filtered_data <- data.frame(matrix(NA, nrow=0, ncol=4))
colnames(filtered_data) <- colnames(my_data)

rows_added <- 0
for(id in 1:3) {
  relevant_data <- my_data[my_data$id == id,]
  for(row in 1:nrow(my_data)) {
      rows_added <- rows_added   1
      filtered_data[rows_added,] <- relevant_data[row,]
      jump_condition <- relevant_data[row, "col1"] == 1 | relevant_data[row, "col2"] == "A"
      if(jump_condition) {
        break
      }
  }
}

CodePudding user response:

You could filter based on all the row_number before the conditions happen using which like this:

id = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
col1 = c(0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0)
col2 = c("A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A","A", "B", "A", "B")
my_data = data.frame(id, col1, col2)
my_data$row_num = 1:nrow(my_data)

library(dplyr)
my_data %>%
  group_by(id) %>%
  filter(row_number() <= min(which((col1 == 1) | (col2 == "A"))))
#> # A tibble: 4 × 4
#> # Groups:   id [3]
#>      id  col1 col2  row_num
#>   <dbl> <dbl> <chr>   <int>
#> 1     1     0 A           1
#> 2     2     0 B           5
#> 3     2     0 A           6
#> 4     3     1 A          10

Created on 2023-01-20 with reprex v2.0.2

  •  Tags:  
  • r
  • Related