Home > Software engineering >  Calculating the probability that a student fails two consecutive exams?
Calculating the probability that a student fails two consecutive exams?

Time:12-14

I am working with the R programming language. I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

      id results date_exam_taken exam_number
7992   1       1      2004-04-23           1
24837  1       0      2004-12-10           2
12331  1       1      2007-01-19           3
34396  1       0      2007-02-21           4
85250  1       0      2007-09-26           5
11254  1       1      2009-12-20           6

my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

I was interested in finding out - suppose a student fails an exam, what is the probability that this student fails the next exam? (e.g. Fails the 1st exam - what is the probability of failing the 2nd exam? Fails the 5th exam - what is the probability of failing the 6th exam?). I wrote the following loop to answer this question:

my_list = list()

for (i in 1:length(unique(my_data$id)))
    
{ 
    {tryCatch({
        
        start_i = my_data[my_data$id == i,]
        
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$id = i
        print(frame_i)
        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


 final = do.call(rbind.data.frame, my_list)
   #################################################



library(dplyr)
total_1 = final %>% group_by(first, second) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = "first", all = TRUE)
join$probs = join$totals/join$sum

The final answer looks something like this:

  first second totals   sum     probs
1     0      0   9817 19557 0.5019686
2     0      1   9740 19557 0.4980314
3     1      0   9727 19498 0.4988717
4     1      1   9771 19498 0.5011283

Now, I am trying to modify the above code so that the analysis is performed at the "Second Level" - that is, I want to find out the probability of a student failing the next exam given that the student failed the two previous exams? As an example, supposed the student failed the 3rd and the 4th exam - what is the probability that a student will fail the 5th exam?

I think that the final answer should look something like this:

# note: "first" and "second" refer to any consecutive exams (not the literal first and second exam), and "third" refers to the exam occurring immediately after the "second" exam

  first second third totals sums probs
1     1      1     1    ...  ...   ...
2     0      0     0    ...  ...   ...
3     1      0     1    ...  ...   ...
4     0      1     0    ...  ...   ...
5     0      0     1    ...  ...   ...
6     1      1     0    ...  ...   ...
7     0      1     1    ...  ...   ...
8     1      0     0    ...  ...   ...

I tried to manually modify my code for these requirements:

library(stringr)
my_list = list()

for (i in 1:length(unique(my_data$id)))

{ 
    {tryCatch({


start_i = my_data[my_data$id == i,]

vals_i = as.numeric(paste(start_i$results, collapse = ""))
L_1_i = lengths(gregexpr("111", vals_i))
L_2_i = lengths(gregexpr("000", vals_i))
L_3_i = lengths(gregexpr("101", vals_i))
L_4_i = lengths(gregexpr("010", vals_i))
L_5_i = lengths(gregexpr("001", vals_i))
L_6_i = lengths(gregexpr("110", vals_i))
L_7_i = lengths(gregexpr("011", vals_i))
L_8_i = lengths(gregexpr("100", vals_i))

frame_i = data.frame(class = c("111","000","101","010","001","110","011","100"), values = c(L_1_i, L_2_i, L_3_i, L_4_i, L_5_i, L_6_i, L_7_i, L_8_i))
frame_i$id = i
print(frame_i)


        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


final = do.call(rbind.data.frame, my_list)


final$first = substr(final$class, 1,1)
final$second = substr(final$class, 2,2)
final$third = substr(final$class, 3,3)


total_1 = final %>% group_by(first, second, third) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first, second) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = c("first", "second"), all = TRUE)
join$probs = join$totals/join$sum

As an example - to answer the original question, we would now query the results (assuming I did this correctly). Here is the probability of a student passing an exam given that the student failed the previous two exams:

 first second third totals   sum probs
1     0      0     0  10000 20000   0.5
2     0      0     1  10000 20000   0.5
3     0      1     0  10000 20000   0.5
4     0      1     1  10000 20000   0.5
5     1      0     0  10000 20000   0.5
6     1      0     1  10000 20000   0.5
7     1      1     0  10000 20000   0.5
8     1      1     1  10000 20000   0.5

join[join$first == 0 & join$second ==0, & join$third == 1,]

I am not sure if what I have done is correct - can someone please help me correct my code (and possibly show me a "better" way to do this)?

Note: These probabilities look two "uniform" and make me inclined to think I did something wrong...

CodePudding user response:

P(A|B) = P(B|A)*P(A)/P(B)

Where A is the probability of failing exam n and B is the probability of failing exams n-1 and n-2.

We can write a function to calculate P(A|B)

d <- my_data
d %>% 
  dplyr::pivot_wider(id, names_from = "exam_number", values_from = "results") -> dd

p_fail_given_failfail <- function(n){ #n is integer > 2
  #check if the student took the exam n. Remove them if they didn't.
  indx.na <- is.na(dd[, n 1])
  dd <- dd[!indx.na, ]
  
  #Calculate probabilities
  p_BA <- nrow(dd[dd[, n 1]==0 & dd[, n]==0 & dd[, n-1]==0, ]) /nrow(dd[dd[, n 1]==0, ])
  p_A <- nrow(dd[dd[, n 1]==0, ]) / nrow(dd)
  p_B <- nrow(dd[dd[, n]==0 & dd[, n-1]==0, ]) / nrow(dd)

  p_AB <- p_BA*p_A/p_B
  return(p_AB)
}

p_fail_given_failfail(3) #prob of failing exam3, given failed exam2 and exam1
#0.5084479
  • Related