I have a dataset formatted as following:
person_ID exam_ID value_1 number_studies
A1 1A1 2 3
A1 2A1 3 3
A1 3A1 1 3
A2 1A2 2 5
A2 2A2 3 5
A2 3A2 3.5 5
A2 4A2 1.5 5
A2 5A2 1.0 5
The data is ordered by person_ID and then by exam_ID. I would like to remove any rows following and including the first row with a difference between value_1 of less then -1.
For example, for person_ID 'A1', I would keep exam_IDs '1A1' and '2A1', but remove '3A1' as the difference between value_1 for '3A1-2A1' is < -1. For person_ID 'A2', I would remove exam_IDs 4A2 and 5A2.
I thought to do this with nested while loops to create a list of exam_IDs and then subset my dataframe, but the code does not work. See example below. I would appreciate any advice/suggestions!
z1 <- list()
for(person in unique(df$person_ID)) {
tempdata <- subset(df, df$person_ID == person)
t1 <- seq(from = 1, to = (unique(tempdata$number_studies)-1))
i <- 0
t <- 1
while(t < (unique(tempdata$number_studies)-1)){
while(i>-1){
i <- tempdata[t 1,3] - tempdata[t,3]
tempID <- tempdata[t,]
z1 <- append(z1, tempID$exam_ID)
t <- t 1
}
}
}
CodePudding user response:
You don't need a loop for this. Here's a solution using data.table
library(data.table)
setDT(dat)
dat[ , drop:=cumsum(c(0,diff(value_1))< -1), by=person_ID][drop==0, !"drop"]
person_ID exam_ID value_1 number_studies
1: A1 1A1 2.0 3
2: A1 2A1 3.0 3
3: A2 1A2 2.0 5
4: A2 2A2 3.0 5
5: A2 3A2 3.5 5
To understand how it works, a variable called drop
is created which incrementally counts the number of values for which the difference between subsequent values is -1 or lower. This is stratified by person_ID
. Then only the rows where drop
is 0 are returned, and drop
itself is dropped.