I have a large list of several hundreds of data frames and trying to filter rows from between two values containing a pattern VALUE1 and VALUE2 in the column Z. Like this:
weight | height | Z
---------------------------
62 100 NA
65 89 NA
59 88 randomnumbersVALUE1randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE2randomtext
57 84 NA
68 99 NA
59 82 NA
60 87 srebmunmodnarVALUE1txetmodnar
61 86 NA
63 84 srebmunmodnarVALUE2txetmodnar
And after filtering I would get:
59 88 randomnumbersVALUE1randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE2randomtext
60 87 srebmunmodnarVALUE1txetmodnar
61 86 NA
63 84 srebmunmodnarVALUE2txetmodnar
The code I'm using is:
lapply(df, function(x){
start <- which(grepl("VALUE1", x$Z))
end <- which(grepl("VALUE2", x$Z))
rows <- unlist(lapply(seq_along(start), function(y){start[y]:end[y]}))
return(df[rows,])})
But whenever I try to run the script, I get an error message saying:
Error in df[rows, ] : incorrect number of dimensions
Why does this happen and how can I get around it..?
EDIT: Added a minimal sample data of the actual datasheet (the first data frame and first element of the list, VALUE2 will follow VALUE 1 always at some point)
> head(tbl[[1]])
# A tibble: 6 × 4
t speed off Z
<dbl> <dbl> <dbl> <chr>
1 27.3 27.8 0.485 "{\"type\":\"M\",\"msg\":\"VALUE1\",\"time\":27.2498,\"dist\":0.410454}"
2 27.4 27.8 0.457 NA
3 27.5 27.8 0.430 NA
4 27.6 27.8 0.402 NA
5 27.7 27.8 0.374 NA
6 27.8 27.8 0.347 NA
CodePudding user response:
Assuming there are equal number of 'VALUE1', 'VALUE2', get the position index of 'VALUE1', 'VALUE2', separately with grep
, create a sequence (:
) by looping over the corresponding positions in Map
, unlist
and use the sequence to subset the data
df1[sort(unique(unlist(Map(`:`, grep("VALUE1", df1$Z),
grep("VALUE2", df1$Z))))),]
-output
weight height Z
3 59 88 randomnumbersVALUE1randomtext
4 66 92 <NA>
5 64 90 <NA>
6 64 87 randomnumbersVALUE2randomtext
10 60 87 srebmunmodnarVALUE1txetmodnar
11 61 86 <NA>
12 63 84 srebmunmodnarVALUE2txetmodnar
If the df
is a single data.frame, when we loop over the data.frame
with lapply
, it will be looping over the columns and thus each list
element is a vector. Therefore, there is no x$Z
. Each x
will be the corresponding column
If it is a list
, then the error can occur when there are cases with no 'VALUE1' or 'VALUE2' or if the number of 'VALUE1' matches are not equal to 'VALUE2'. It may be better to check those elements before doing the :
data
df1 <- structure(list(weight = c(62L, 65L, 59L, 66L, 64L, 64L, 57L,
68L, 59L, 60L, 61L, 63L), height = c(100L, 89L, 88L, 92L, 90L,
87L, 84L, 99L, 82L, 87L, 86L, 84L), Z = c(NA, NA,
"randomnumbersVALUE1randomtext",
NA, NA, "randomnumbersVALUE2randomtext", NA, NA, NA,
"srebmunmodnarVALUE1txetmodnar",
NA, "srebmunmodnarVALUE2txetmodnar")),
class = "data.frame", row.names = c(NA,
-12L))