Home > Back-end >  Filter rows per group with a specific condition
Filter rows per group with a specific condition

Time:03-23

I have a grouped data frame (meaning that I have a list of patients and each patient has several treatments). I now want to remove all rows, that do not meet a specific criteria:

The last column contains yes and no. I only want the rows that contains the last no and the first yes.

PATIENT.ID   Caffeinefactor    
---------------------------
21       no      
21       no      
21       no
21       yes      
21       yes      
34       no      
34       no      
34       yes      
34       yes      
16       no      
16       no      
DF = structure(list(PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34, 
16, 16), Caffeinefactor  = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", 
"no", "no"),.Names = c("PATIENT.ID", "Caffeinefactor"), class = "data.frame")

Expected Output:

PATIENT.ID   Caffeinefactor    
---------------------------    
21       no
21       yes                
34       no      
34       yes           

I started off with this code:

daf1 <- df %>% 
  group_by(PATIENT.ID) %>% 
  mutate(firstc = (Caffeinefactor == 'yes' & lag(Caffeinefactor) == 'no')) %>% 
  # Find the first row with `double_B` in each group, filter out rows after it
  filter(row_number() <= min(which(firstc == TRUE)))   

however, I cannot seem to solve the problem, to delete everything before AND after the two needed rows.

CodePudding user response:

library(data.table)

# dummy data              
df <- data.table(patient = c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16)
                 , caffiene = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
                 )
df[, time := 1:.N][]

    patient caffiene time
 1:      21       no    1
 2:      21       no    2
 3:      21      yes    3
 4:      21      yes    4
 5:      21      yes    5
 6:      34       no    6
 7:      34       no    7
 8:      34      yes    8
 9:      34      yes    9
10:      16       no   10
11:      16       no   11

# identify 1st yes
df[caffiene == 'yes', nth_yes := 1:.N, patient]

# fetch row index of 1st 'yes' and the 'no' immediately preceeding
x <- df[, .I[nth_yes == 1 & !is.na(nth_yes)] ]
y <- x-1

# return corresponding rows
df[c(y,x)][, nth_yes := NULL][order(patient)]

   patient caffiene time
1:      21       no    2
2:      21      yes    3
3:      34       no    7
4:      34      yes    8

CodePudding user response:

In case there are other columns, you can try the following:

library(dplyr)
df %>% 
    # add row id only to confirm if the correct rows were selected
    mutate(row_id = row_number()) %>% 
    group_by(PATIENT.ID) %>% 
    mutate(first_yes = cumsum(Caffeinefactor == "yes"), 
           last_no = (Caffeinefactor == "no") * row_number()) %>% 
    filter(first_yes == 1 | last_no == max(last_no)) %>%
    select(-first_yes, -last_no)
#> # A tibble: 5 x 3
#> # Groups:   PATIENT.ID [3]
#>   PATIENT.ID Caffeinefactor row_id
#>        <dbl> <chr>           <int>
#> 1         21 no                  2
#> 2         21 yes                 3
#> 3         34 no                  7
#> 4         34 yes                 8
#> 5         16 no                 11

Input:

df <- data.frame(
    PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16), 
    Caffeinefactor = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
)
  •  Tags:  
  • r
  • Related