I am working with the R programming language.
I am working with the R programming language. I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
id results date_exam_taken exam_number
63018 1 0 2001-08-15 1
72324 1 1 2002-09-03 2
98866 1 0 2003-01-13 3
56137 1 1 2005-06-15 4
77746 1 0 2007-06-26 5
21438 1 0 2011-09-23 6
I then transformed the data into the following format:
library(tidyr)
my_data = my_data %>%
pivot_wider(id, names_from = "exam_number", values_from = "results")
# A tibble: 10,000 x 24
id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22` `23`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0 0 0 1 0 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 2 1 0 1 1 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 3 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 NA NA NA NA NA
4 4 1 1 0 0 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 5 1 0 1 0 0 1 0 0 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
6 6 1 1 0 1 1 0 0 1 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
7 7 0 0 1 1 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 8 0 1 0 1 0 1 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9 9 0 0 0 0 0 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA
10 10 0 0 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 9,990 more rows
Now, suppose I have the following sequences:
my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)
my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)
Var1 Var2 Var3 sequence seq
1 0 0 0 sequence 1 000
2 1 0 0 sequence 2 100
3 0 1 0 sequence 3 010
4 1 1 0 sequence 4 110
5 0 0 1 sequence 5 001
6 1 0 1 sequence 6 101
7 0 1 1 sequence 7 011
8 1 1 1 sequence 8 111
GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?
I tried to approach this problem as follows - I took the exam scores of each students and concatenated them into a single string, and made this into a new row. This should make it easier to recognize a desired pattern:
my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
val_i = paste(my_data[i,-1],collapse="")
print(val_i)
my_list[[i]] = val_i
}
my_data$cols <- my_list
my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}
PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:
#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions
Ideally, I am looking for the final result to look something like this (from here, I could simply calculate the conditional probabilities):
# FINAL RESULT
Var1 Var2 Var3 sequence seq counts
1 0 0 0 sequence 1 000 ...
2 1 0 0 sequence 2 100 ...
3 0 1 0 sequence 3 010 ...
4 1 1 0 sequence 4 110 ...
5 0 0 1 sequence 5 001 ...
6 1 0 1 sequence 6 101 ...
7 0 1 1 sequence 7 011 ...
8 1 1 1 sequence 8 111 ...
QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?
Thanks!
- NOTE 1: Instead of using a function, I tried to do this with a for loop.
Here is the code I wrote:
my_list = list()
for (i in 1:length(my_grid$seq))
{
seq_i = my_grid$seq[i]
val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
print(c(i, seq_i, val_i))
}
[1] "1" "000" "11255"
[1] "2" "100" "12743"
[1] "3" "010" "12145"
[1] "4" "110" "12676"
[1] "5" "001" "12765"
[1] "6" "101" "12085"
[1] "7" "011" "12672"
[1] "8" "111" "11201"
But for some reason, I don't think this is correct (i.e. counts look rather high)?
- NOTE 2: I am also trying to make sure that the conditional probabilities are calculated using individual students scores and not by "clumping" all student scores together.
E.g.
student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0
It would be incorrect to combine the scores of both of these students into a single string "1,1,0,0,1,0,0, 1,0,0,1,1,1,0"
and then calculate the frequency counts - I would like to calculate these counts at the student level and then add them up together.
CodePudding user response:
The issue may be that gregexpr
also returns -1
when there is nomatch. When we use lengths
, it will be counted as 1 and this would inflate the count with sum
. If we change the function to
my_fun <- function(seq, data){
sum(lengths(lapply(gregexpr(seq, data), function(x) x[x != -1])))
}
Then we use this function as
library(dplyr)
my_grid %>%
rowwise %>%
mutate(counts = my_fun(seq, my_data$cols)) %>%
ungroup
# A tibble: 8 × 6
Var1 Var2 Var3 sequence seq counts
<int> <int> <int> <chr> <chr> <int>
1 0 0 0 sequence 1 000 6215
2 1 0 0 sequence 2 100 10018
3 0 1 0 sequence 3 010 8325
4 1 1 0 sequence 4 110 9939
5 0 0 1 sequence 5 001 10072
6 1 0 1 sequence 6 101 8274
7 0 1 1 sequence 7 011 9973
8 1 1 1 sequence 8 111 6097
Even when we test the first 6 elements, there are 2 cases that doesn't have a match which returns -1
> gregexpr(my_grid$seq[1], head(my_data$cols))
[[1]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[3]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[4]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[5]]
[1] 4
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[6]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE