R Error: Error in input[i, ] : incorrect number of dimensions-CodePudding

I am working with the R programming language.

I am working with the R programming language. I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

      id results date_exam_taken exam_number
63018  1       0      2001-08-15           1
72324  1       1      2002-09-03           2
98866  1       0      2003-01-13           3
56137  1       1      2005-06-15           4
77746  1       0      2007-06-26           5
21438  1       0      2011-09-23           6

I then transformed the data into the following format:

library(tidyr)

my_data = my_data %>% 
  pivot_wider(id, names_from = "exam_number", values_from = "results")

# A tibble: 10,000 x 24
      id   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`  `19`  `20`  `21`  `22`  `23`
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     0     1     0     1     0     0     0     1     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2     2     1     0     1     1     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3     3     1     0     1     0     1     1     1     1     0     1     1     1     0     0     0     1     1     1    NA    NA    NA    NA    NA
 4     4     1     1     0     0     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5     5     1     0     1     0     0     1     0     0     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6     6     1     1     0     1     1     0     0     1     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7     7     0     0     1     1     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8     8     0     1     0     1     0     1     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9     9     0     0     0     0     0     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10    10     0     0     1     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# ... with 9,990 more rows

Now, suppose I have the following sequences:

my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)

my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)

     Var1 Var2 Var3 sequence seq
1    0    0    0 sequence 1  000
2    1    0    0 sequence 2  100
3    0    1    0 sequence 3  010
4    1    1    0 sequence 4  110
5    0    0    1 sequence 5  001
6    1    0    1 sequence 6  101
7    0    1    1 sequence 7  011
8    1    1    1 sequence 8  111

GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?

I tried to approach this problem as follows - I took the exam scores of each students and concatenated them into a single string, and made this into a new row. This should make it easier to recognize a desired pattern:

my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
 val_i = paste(my_data[i,-1],collapse="")
print(val_i)
 my_list[[i]] = val_i
}

my_data$cols <- my_list

my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}

PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:

#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions

Ideally, I am looking for the final result to look something like this (from here, I could simply calculate the conditional probabilities):

# FINAL RESULT
  Var1 Var2 Var3   sequence seq counts
1    0    0    0 sequence 1 000    ...
2    1    0    0 sequence 2 100    ...
3    0    1    0 sequence 3 010    ...
4    1    1    0 sequence 4 110    ...
5    0    0    1 sequence 5 001    ...
6    1    0    1 sequence 6 101    ...
7    0    1    1 sequence 7 011    ...
8    1    1    1 sequence 8 111    ...

QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

NOTE 1: Instead of using a function, I tried to do this with a for loop.

Here is the code I wrote:

my_list = list()
for (i in 1:length(my_grid$seq))
{
    seq_i = my_grid$seq[i]
    val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
    print(c(i, seq_i, val_i))
}

[1] "1"     "000"   "11255"
[1] "2"     "100"   "12743"
[1] "3"     "010"   "12145"
[1] "4"     "110"   "12676"
[1] "5"     "001"   "12765"
[1] "6"     "101"   "12085"
[1] "7"     "011"   "12672"
[1] "8"     "111"   "11201"

But for some reason, I don't think this is correct (i.e. counts look rather high)?

NOTE 2: I am also trying to make sure that the conditional probabilities are calculated using individual students scores and not by "clumping" all student scores together.

E.g.

student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0

It would be incorrect to combine the scores of both of these students into a single string "1,1,0,0,1,0,0, 1,0,0,1,1,1,0" and then calculate the frequency counts - I would like to calculate these counts at the student level and then add them up together.

CodePudding user response：

The issue may be that gregexpr also returns -1 when there is nomatch. When we use lengths, it will be counted as 1 and this would inflate the count with sum. If we change the function to

 my_fun <- function(seq, data){
    sum(lengths(lapply(gregexpr(seq, data), function(x) x[x != -1])))   
  }

Then we use this function as

library(dplyr)
my_grid %>% 
  rowwise %>% 
  mutate(counts = my_fun(seq, my_data$cols)) %>%
  ungroup
# A tibble: 8 × 6
   Var1  Var2  Var3 sequence   seq   counts
  <int> <int> <int> <chr>      <chr>  <int>
1     0     0     0 sequence 1 000     6215
2     1     0     0 sequence 2 100    10018
3     0     1     0 sequence 3 010     8325
4     1     1     0 sequence 4 110     9939
5     0     0     1 sequence 5 001    10072
6     1     0     1 sequence 6 101     8274
7     0     1     1 sequence 7 011     9973
8     1     1     1 sequence 8 111     6097

Even when we test the first 6 elements, there are 2 cases that doesn't have a match which returns -1

> gregexpr(my_grid$seq[1], head(my_data$cols))
[[1]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[5]]
[1] 4
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[6]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE