Filter row by first and last condition in R-CodePudding

I would like to remove all rows before a specific factor (in this case the first Running in Run column) and all rows after the last specific case (in this case also Running in Run Column).

data <- tribble( ~Date, ~Time, ~Run, "17/04/12", "00:10:00", "Stop", "17/04/12", "00:19:00", "Stop", "17/04/12", "00:25:00", "Running", "17/04/12", "00:29:00", "Running", "17/04/12", "00:25:00", "Stop", "17/04/12", "00:29:00", "Running", "17/04/12", "00:30:00", "Running", "18/04/12", "00:10:00", "Stop", "18/04/12", "00:14:00", "Running", "18/04/12", "00:20:00", "Stop" "18/04/12", "00:24:00", "Stop" "18/04/12", "00:26:00", "Stop" )

Date	Time	Run
17/04/12	00:10:00
17/04/12	00:19:00
17/04/12	00:25:00	Running
17/04/12	00:29:00	Running
17/04/12	00:25:00	Stop
17/04/12	00:29:00	Running
17/04/12	00:30:00	Running
18/04/12	00:10:00	Stop
18/04/12	00:14:00	Running
18/04/12	00:20:00	Stop
18/04/12	00:24:00	Stop
18/04/12	00:26:00	Stop

Result:

Date	Time	Run
17/04/12	00:25:00	Running
17/04/12	00:29:00	Running
17/04/12	00:25:00	Stop
17/04/12	00:29:00	Running
17/04/12	00:30:00	Running
18/04/12	00:10:00	Stop
18/04/12	00:14:00	Running

CodePudding user response：

Here's a solution using dplyr

library(dplyr)

df %>% 
  filter(Run %in% c("Running", "Stop")) %>% 
  slice(1:max(which(Run == "Running")))

This gives us:

# A tibble: 7 x 3
  Date     Time     Run    
  <chr>    <chr>    <chr>  
1 17/04/12 00:25:00 Running
2 17/04/12 00:29:00 Running
3 17/04/12 00:25:00 Stop   
4 17/04/12 00:29:00 Running
5 17/04/12 00:30:00 Running
6 18/04/12 00:10:00 Stop   
7 18/04/12 00:14:00 Running

CodePudding user response：

Along your data analysis journey, you will face this problem often. If you want to or will have to add more logical checks, you can introduce variables on which you filter. In your case we can simply count the occurrences of Running and Stop. We then use the additional columns to filter your data frame.

library(dplyr)

data %>% 
 mutate(  RunCounts = cumsum(Run == "Running")
       , StopCounts = cumsum(Run == "Stop")
       ) %>% 
 filter(RunCounts >= 1 & StopCounts < max(StopCounts))

This yields:

# A tibble: 7 x 5
  Date     Time     Run     RunCounts StopCounts
  <chr>    <chr>    <chr>       <int>      <int>
1 17/04/12 00:25:00 Running         1          2
2 17/04/12 00:29:00 Running         2          2
3 17/04/12 00:25:00 Stop            2          3
4 17/04/12 00:29:00 Running         3          3
5 17/04/12 00:30:00 Running         4          3
6 18/04/12 00:10:00 Stop            4          4
7 18/04/12 00:14:00 Running         5          4

You can remove the introduced columns by selecting the ones you are interested in:

data %>% mutate(RunCounts = cumsum(Run == "Running"), StopCounts = cumsum(Run == "Stop")) %>% filter(RunCounts >= 1 & StopCounts < max(StopCounts)) %>% select(Date, Time, Run)
# A tibble: 7 x 3
  Date     Time     Run    
  <chr>    <chr>    <chr>  
1 17/04/12 00:25:00 Running
2 17/04/12 00:29:00 Running
3 17/04/12 00:25:00 Stop   
4 17/04/12 00:29:00 Running
5 17/04/12 00:30:00 Running
6 18/04/12 00:10:00 Stop   
7 18/04/12 00:14:00 Running