I have a dataframe with several columns. One column (C3) includes a marker which is defined by certain pattern of values between the range of 8 and 10 for 50 or more rows (marker in example only with 4 rows). Those values could be found spreaded in the rest of the dataframe as well (also more than 50 times if summed up), but only in the marker it's found in consecutive lines for 50 times or more.
data <- data.frame(C1 = c(12.3, 5.2, 2.43, 6, 8.1, 1, 3.25, 3.67, 12.87, NA, 2, 4.5),
C2 = c(NA, 7, 2.5, 4.53, 3.5, NA, 9, 9.75, 10.4, NA, 3, NA),
C3 = c(3.41, 5.3, 8.1, 9.5, 9.99, 8.45, 7.2, 6.8, 5.4, 3.2, NA, 4.3),
C4 = c(10.54, 2.3, 8.4, 9.32, 3.65, 5.8, NA, 4.5, 7.6, NA, 6.7, 5.6),
C5 = c(3.45, 6.97, 5, 9.87, 2, 3.4, 5.6, 9.8, NA, NA, 0.3, 11.5))
I would like to start a new dataframe for calculations at the end of the marker 5 more rows to be really sure about "clean" data without any marker leftovers. Those markers can be found in the column at random places and I always want to start there with the new dataframe. That in the end my new dataframe should be like this...
data_new <- data.frame(C1 = c(3.25, 3.67, 9.87, NA, 2, 4.5),
C2 = c(9, 9.75, 10.4, NA, 3, NA),
C3 = c(7.2, 6.8, 5.4, 3.2, NA, 4.3),
C4 = c(NA, 4.5, 7.6, NA, 6.7, 5.6),
C5 = c(5.6, 9.8, NA, NA, 0.3, 11.5))
So that the beginning of the old dataframe inclusive marker are cut away. Thanks in advance!
CodePudding user response:
A possible solution:
nmarkers <- 4
marker <- ifelse(is.na(data$C3),0,data$C3>8 & data$C3<10)
data[cumsum(marker)>=nmarkers & marker==0,]
C1 C2 C3 C4 C5
7 3.25 9.00 7.2 NA 5.6
8 3.67 9.75 6.8 4.5 9.8
9 9.87 10.40 5.4 7.6 NA
10 NA NA 3.2 NA NA
11 2.00 3.00 NA 6.7 0.3
12 4.50 NA 4.3 5.6 NA
CodePudding user response:
I'm not sure if understood the question correctly but I think you are looking for a general solution that can search any column for a string of 50 unbroken marker values, and slice at the end point of those markers plus a 5 row buffer. I thought some sort of sliding window approach could work in a function...even if maybe not the most efficient computationally. I just re-read the question though, and saw '50 or more', but I think this will work only for precisely the number of markers specified. Maybe you can use it as a starting point to be more flexible.
data <- data.frame(C1 = c(12.3, 5.2, 2.43, 6, 8.1, 1, 3.25, 3.67, 9.87, NA, 2, 4.5),
C2 = c(NA, 7, 2.5, 4.53, 3.5, NA, 9, 9.75, 10.4, NA, 3, NA),
C3 = c(3.41, 5.3, 8.1, 9.5, 9.99, 8.45, 7.2, 6.8, 5.4, 3.2, NA, 4.3),
C4 = c(8.54, 2.3, 8.4, 9.32, 3.65, 5.8, NA, 4.5, 7.6, NA, 6.7, 5.6),
C5 = c(3.45, 6.97, 5, 9.87, 2, 3.4, 5.6, 9.8, NA, NA, 0.3, NA))
df.slice <- function(data, upper_mark_value, lower_mark_value, chunk.size, buffer){
matrix <- as.matrix(data)
matrix <- apply(data,c(1,2), FUN = function(x) ifelse(x > lower_mark_value & x < upper_mark_value & is.na(x) == FALSE,1,0))
cs <- chunk.size
start.row <- 1
stop.row <- (start.row cs)-1
repeat{
matrix.chunk <- matrix[start.row:stop.row,]
totals <- apply(matrix.chunk, 2, sum)
if(length(which(totals == cs)) > 0){break}
start.row <- start.row 1
stop.row <- stop.row 1
if(stop.row > nrow(matrix)){break}
}
buf <- buffer
return(data[((stop.row 1) buf):(nrow(data)),])
}
df.slice(data,10,8,4,0)
C1 C2 C3 C4 C5
7 3.25 9.00 7.2 NA 5.6
8 3.67 9.75 6.8 4.5 9.8
9 9.87 10.40 5.4 7.6 NA
10 NA NA 3.2 NA NA
11 2.00 3.00 NA 6.7 0.3
12 4.50 NA 4.3 5.6 NA