Problem and explanation of the expected output
I have raw data of how many attempts each student had in an exam and which items they responded (input
). For this example, I have a pool of 5 items, but students only respond to 3 of those items. They can take the exam multiple times. Hence, I would like to summarize the raw data in a way that I can see all info below:
- Trial number (e.g., 1st trial, 2nd trial, 3rd trial...).
trial
column - How many correct items (e.g., total correct items) How many total items they've responded to (in this case, it is supposed to be 3 for everyone),
- Percent of correct.
percent
column - Main problem: How many of the items had they seen before?
repeated
column (e.g., if they responded to item 1 in the first trial and also responded to item 1 in the second trial, then I want to know this info)- If there are repeated items, I want to see which repeated
items were there.
repeated_items
column
- If there are repeated items, I want to see which repeated
items were there.
Input
As input
, I have the raw data, with students name
, the number of the trial
(e.g., 1, 2, 3...), and all items from the pool i1
:i5
. The values for the items (0, 1) shows whether the student got it right or wrong (1 = correct, 0 = wrong). And missing values show that the student didn't take that item on that attempt.
library(dplyr)
df <- tibble(name = c("John", "John", "Mary", "Mary"),
trial = c(1, 2, 1, 2),
i1 = c(1, 0, 0, NA),
i2 = c(NA, NA, 1, 1),
i3 = c(NA, 1, NA, 1),
i4 = c(0, 1, 1, NA),
i5 = c(0, NA, NA, 1))
# # A tibble: 4 × 7
# name trial i1 i2 i3 i4 i5
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 John 1 1 NA NA 0 0
# 2 John 2 0 NA 1 1 NA
# 3 Mary 1 0 1 NA 1 NA
# 4 Mary 2 NA 1 1 NA 1
Expected Output
As output
I want to see a summary table like this one below, showing all points I've put in the description above. I've created this table manually, and now I'm trying to organize it with functions, so I can speed up this process. My main problem is to generate columns repeated
and repeated_items
. The first trial of each students will always have NA
for repeated_items
and zero for repeated
given this is the first trial, so they've never seen those items before.
repeated
column counts how many items the participant had already seen in a previous attempt. So for instance, in John's first attempt, he responded to items i1, i4, and i5. In his second attempt, he responded to items i1, i3, and i4. Hence, in his second attempt, he responded to two items that were previously seen (i1, and i4). So I want to use the columnrepeated
to count how many items were repeated from the previous attempt.repeated_item
column tracks which specific items were "repeated" in this new trial. So, in John's case, in the second trial, items i1 and i4 were repeated. And in Mary's second trial, only item i2 was repeated when compared to her first trial.# # A tibble: 4 × 7 # name trial correct n_items percent repeated repeated_items # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> # 1 John 1 1 3 0.333 0 NA # 2 John 2 2 3 0.667 2 i1, i4 # 3 Mary 1 2 3 0.667 0 NA # 4 Mary 2 3 3 1 1 i2
CodePudding user response:
We could use
library(dplyr)
library(stringr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("i"),
names_to = "i",values_drop_na = TRUE) %>%
add_count(name, i, name = "n1") %>%
mutate(i =replace(i, n1 <2, NA_character_)) %>%
group_by(name, trial) %>%
summarise(correct = sum(value), n_items = n(),
percent = mean(value), repeated= n_distinct(i, na.rm = TRUE),
repeated_items = str_c(i[!is.na(i)], collapse = ", "),
.groups = 'drop_last') %>%
mutate(across(starts_with('repeated'), ~ replace(.x, 1, NA)))
-output
# A tibble: 4 × 7
# Groups: name [2]
name trial correct n_items percent repeated repeated_items
<chr> <dbl> <dbl> <int> <dbl> <int> <chr>
1 John 1 1 3 0.333 NA <NA>
2 John 2 2 3 0.667 2 i1, i4
3 Mary 1 2 3 0.667 NA <NA>
4 Mary 2 3 3 1 1 i2
CodePudding user response:
library(tidyverse)
df %>%
mutate(correct = rowSums(across(starts_with("i")), na.rm = T),
n_items = rowSums(!is.na(across(starts_with("i")))),
percent = correct / n_items) %>%
group_by(name) %>%
mutate(repeated_items = across(starts_with("i")) %>%
imap_chr(~ ifelse(!is.na(diff(.)), .y, NA)) %>%
na.omit() %>%
str_flatten(","),
repeated_items = ifelse(row_number() == 1, NA, repeated_items),
repeated = replace_na(str_count(repeated_items, ",") 1, 0)) %>%
ungroup()
Output
name trial i1 i2 i3 i4 i5 correct n_items percent repeated_items repeated
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 John 1 1 NA NA 0 0 1 3 0.333 NA 0
2 John 2 0 NA 1 1 NA 2 3 0.667 i1,i4 2
3 Mary 1 0 1 NA 1 NA 2 3 0.667 NA 0
4 Mary 2 NA 1 1 NA 1 3 3 1 i2 1