I have a grouped data frame (meaning that I have a list of patients and each patient has several treatments). I now want to remove all rows, that do not meet a specific criteria:
The last column contains yes and no. I only want the rows that contains the last no and the first yes.
PATIENT.ID Caffeinefactor
---------------------------
21 no
21 no
21 no
21 yes
21 yes
34 no
34 no
34 yes
34 yes
16 no
16 no
DF = structure(list(PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34,
16, 16), Caffeinefactor = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes",
"no", "no"),.Names = c("PATIENT.ID", "Caffeinefactor"), class = "data.frame")
Expected Output:
PATIENT.ID Caffeinefactor
---------------------------
21 no
21 yes
34 no
34 yes
I started off with this code:
daf1 <- df %>%
group_by(PATIENT.ID) %>%
mutate(firstc = (Caffeinefactor == 'yes' & lag(Caffeinefactor) == 'no')) %>%
# Find the first row with `double_B` in each group, filter out rows after it
filter(row_number() <= min(which(firstc == TRUE)))
however, I cannot seem to solve the problem, to delete everything before AND after the two needed rows.
CodePudding user response:
library(data.table)
# dummy data
df <- data.table(patient = c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16)
, caffiene = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
)
df[, time := 1:.N][]
patient caffiene time
1: 21 no 1
2: 21 no 2
3: 21 yes 3
4: 21 yes 4
5: 21 yes 5
6: 34 no 6
7: 34 no 7
8: 34 yes 8
9: 34 yes 9
10: 16 no 10
11: 16 no 11
# identify 1st yes
df[caffiene == 'yes', nth_yes := 1:.N, patient]
# fetch row index of 1st 'yes' and the 'no' immediately preceeding
x <- df[, .I[nth_yes == 1 & !is.na(nth_yes)] ]
y <- x-1
# return corresponding rows
df[c(y,x)][, nth_yes := NULL][order(patient)]
patient caffiene time
1: 21 no 2
2: 21 yes 3
3: 34 no 7
4: 34 yes 8
CodePudding user response:
In case there are other columns, you can try the following:
library(dplyr)
df %>%
# add row id only to confirm if the correct rows were selected
mutate(row_id = row_number()) %>%
group_by(PATIENT.ID) %>%
mutate(first_yes = cumsum(Caffeinefactor == "yes"),
last_no = (Caffeinefactor == "no") * row_number()) %>%
filter(first_yes == 1 | last_no == max(last_no)) %>%
select(-first_yes, -last_no)
#> # A tibble: 5 x 3
#> # Groups: PATIENT.ID [3]
#> PATIENT.ID Caffeinefactor row_id
#> <dbl> <chr> <int>
#> 1 21 no 2
#> 2 21 yes 3
#> 3 34 no 7
#> 4 34 yes 8
#> 5 16 no 11
Input:
df <- data.frame(
PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16),
Caffeinefactor = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
)