Simulating correlated answers to a multi-choice test-CodePudding

I am trying to simulate the answers to a multi-choice question test (MCQ). Currently, I am using the following code to simulate the answers to a MCQ with only two questions:

answers <- data.frame(
Q1 = sample(LETTERS[1:5],10,replace = T, prob=c(0.1,0.6,0.1,0.1,0.1)),
Q2 = sample(LETTERS[1:5],10,replace = T, prob=c(0.5,0.1,0.1,0.2,0.1)))

The answers B and A are, respectively, the correct answers to Q1 and Q2.

My difficulty is to introduce correlation among the answers to the questions, in the sense that, for instance, a good student tends to select the correct answer to all questions. How can I accomplish that?

CodePudding user response：

You could fill up the data with completely correct answers, assign a level of proficiency to each individual student and then randomly change values in their exams, depending on their proficiency:

correct = c(2,1,3)
nstudents = 20
exam = matrix(LETTERS[rep(correct,nstudents)],ncol=length(correct),byrow=T)
colnames(exam)=paste("Q",1:length(correct),sep="")

proficiency = runif(nstudents,1,5)/5 ## Each student has a level of expertise

for(question in 1:length(correct)){
  difficulty = runif(nstudents,1,10)/10  ## Random difficulty for each question and student (may be made more or less difficult)
  nmistakes = sum(proficiency<difficulty)
  exam[,question][proficiency<difficulty] = sample(LETTERS[1:5],nmistakes,replace=T)
}

exam = as.data.frame(exam)

The result would be a data frame in which some students hardly ever make mistakes while others hardly ever get something right.

EDIT: The proficiency, in this case, follows an uniform distribution. If you need them normally distributed, just change the proficiency vector to use rnorm().

CodePudding user response：

Here is a method that applies a covariance matrix Sigma= using MASS::mvrnorm.

n <- 15
r <- .9
set.seed(42)
library('MASS')
M <- abs(mvrnorm(n=n, mu=c(1, 500), Sigma=matrix(c(1, r, r, 1), nrow=2), 
                empirical=TRUE)) |>
  as.data.frame() |>
  setNames(c('Q1', 'Q2'))

We get the correlated levels A, ..., B by cutting the random numbers along custom quantiles (taken from OP),

f <- \(x, q) cut(x, breaks=c(0, quantile(x, cumsum(q))), include.lowest=T, 
                 labels=LETTERS[1:5])

p1 <- c(0.1, 0.6, 0.1, 0.1, 0.1)
p2 <- c(0.5, 0.1, 0.1, 0.2, 0.1)

in a Map() call.

dat <- Map(f, M, list(p1, p2)) |>
  as.data.frame()
dat
#    Q1 Q2
# 1   A  A
# 2   B  A
# 3   E  E
# 4   D  E
# 5   A  A
# 6   B  A
# 7   C  D
# 8   B  A
# 9   B  A
# 10  B  A
# 11  B  C
# 12  B  B
# 13  E  D
# 14  B  A
# 15  C  D

Check

dat_check <- lapply(dat, as.integer) |> as.data.frame()
cor(dat_check)  ## correlation
#         Q1      Q2
# Q1 1.00000 0.85426
# Q2 0.85426 1.00000

lapply(dat, table)  ## students' answers
# $Q1
# 
# A B C D E 
# 2 8 2 1 2 
# 
# $Q2
# 
# A B C D E 
# 8 1 1 3 2