Home > OS >  Lapply incorrect number of dimensions in a list of data frames..?
Lapply incorrect number of dimensions in a list of data frames..?

Time:03-30

I have a large list of several hundreds of data frames and trying to filter rows from between two values containing a pattern VALUE1 and VALUE2 in the column Z. Like this:

weight | height | Z
---------------------------
62      100      NA
65      89       NA
59      88       randomnumbersVALUE1randomtext
66      92       NA
64      90       NA
64      87       randomnumbersVALUE2randomtext
57      84       NA
68      99       NA
59      82       NA
60      87       srebmunmodnarVALUE1txetmodnar
61      86       NA
63      84       srebmunmodnarVALUE2txetmodnar

And after filtering I would get:

59      88       randomnumbersVALUE1randomtext
66      92       NA
64      90       NA
64      87       randomnumbersVALUE2randomtext
60      87       srebmunmodnarVALUE1txetmodnar
61      86       NA
63      84       srebmunmodnarVALUE2txetmodnar

The code I'm using is:

lapply(df, function(x){
  start <- which(grepl("VALUE1", x$Z))
  end   <- which(grepl("VALUE2", x$Z))
  rows  <- unlist(lapply(seq_along(start), function(y){start[y]:end[y]}))
  return(df[rows,])})

But whenever I try to run the script, I get an error message saying:

Error in df[rows, ] : incorrect number of dimensions

Why does this happen and how can I get around it..?

EDIT: Added a minimal sample data of the actual datasheet (the first data frame and first element of the list, VALUE2 will follow VALUE 1 always at some point)

> head(tbl[[1]])
# A tibble: 6 × 4
      t speed   off Z                                                                       
  <dbl> <dbl> <dbl> <chr>                                                                   
1  27.3  27.8 0.485 "{\"type\":\"M\",\"msg\":\"VALUE1\",\"time\":27.2498,\"dist\":0.410454}"
2  27.4  27.8 0.457  NA                                                                     
3  27.5  27.8 0.430  NA                                                                     
4  27.6  27.8 0.402  NA                                                                     
5  27.7  27.8 0.374  NA                                                                     
6  27.8  27.8 0.347  NA    

CodePudding user response:

Assuming there are equal number of 'VALUE1', 'VALUE2', get the position index of 'VALUE1', 'VALUE2', separately with grep, create a sequence (:) by looping over the corresponding positions in Map, unlist and use the sequence to subset the data

df1[sort(unique(unlist(Map(`:`, grep("VALUE1", df1$Z),
       grep("VALUE2", df1$Z))))),]

-output

 weight height                             Z
3      59     88 randomnumbersVALUE1randomtext
4      66     92                          <NA>
5      64     90                          <NA>
6      64     87 randomnumbersVALUE2randomtext
10     60     87 srebmunmodnarVALUE1txetmodnar
11     61     86                          <NA>
12     63     84 srebmunmodnarVALUE2txetmodnar

If the df is a single data.frame, when we loop over the data.frame with lapply, it will be looping over the columns and thus each list element is a vector. Therefore, there is no x$Z. Each x will be the corresponding column

If it is a list, then the error can occur when there are cases with no 'VALUE1' or 'VALUE2' or if the number of 'VALUE1' matches are not equal to 'VALUE2'. It may be better to check those elements before doing the :

data

df1 <- structure(list(weight = c(62L, 65L, 59L, 66L, 64L, 64L, 57L, 
68L, 59L, 60L, 61L, 63L), height = c(100L, 89L, 88L, 92L, 90L, 
87L, 84L, 99L, 82L, 87L, 86L, 84L), Z = c(NA, NA,
 "randomnumbersVALUE1randomtext", 
NA, NA, "randomnumbersVALUE2randomtext", NA, NA, NA, 
"srebmunmodnarVALUE1txetmodnar", 
NA, "srebmunmodnarVALUE2txetmodnar")), 
class = "data.frame", row.names = c(NA, 
-12L))
  • Related