I am working with the R programming language.
I have this dataset over here - different students flip a coin a different number of times:
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
Based on my data, I want to count the number of "3 sequence" coin flips sequences for each student.
I know how to do this for the entire dataset at once:
# https://stackoverflow.com/questions/74758896/r-counting-the-frequencies-of-coin-flips
results = my_data$coin_result
n_sequences <- function(n, results) {
helper <- function(i, n) if (n < 1) "" else sprintf(
"%s%s",
helper(i, n - 1),
results[i n - 1]
)
result <- data.frame(
table(
sapply(
1:(length(results) - n 1),
function(i) helper(i, n)
)
)
)
colnames(result) <- c("Sequence", "Frequency")
result
}
n_sequences(3, results)
Sequence Frequency
1 HHH 140
2 HHT 129
3 HTH 132
4 HTT 119
5 THH 129
6 THT 121
7 TTH 119
8 TTT 109
Now, I am trying to perform similar calculations - but for individual students - and then grouped over all students. That is, I want the "counter" to restart every time a new student starts flipping the coin. Thus, this would allow me to find out the total number of times "HHH" appears for all students individually.
I thought of a very slow and inefficient way to do this:
library(dplyr)
my_list = list()
for (i in 1:length(unique(ids))) {
tryCatch({
frame_i = my_data[my_data$student_id == i,]
results_i = frame_i$coin_result
results = results_i
results_i = n_sequences(3, results)
final_i = cbind(student_id = i, results_i)
my_list[[i]] = final_i
#print(final_i)
}, error = function(e) {})
}
goal = do.call(rbind.data.frame, my_list)
# EXPECTED OUTPUT
summary = goal %>% group_by(Sequence) %>% summarise(sums = sum(Frequency))
> summary
# A tibble: 8 x 2
Sequence sums
<fct> <int>
1 HTT 93
2 TTH 93
3 HHH 112
4 HHT 106
5 HTH 108
6 THH 97
7 TTT 94
8 THT 97
Even if my approach is correct - I have a feeling that running this loop for big datasets (e.g. when there over 1 million student_id) will take a long time to run.
Can someone please suggest a more efficient way to solve this problem?
Thanks!
Note: I am not sure the n_sequence()
function can work if any student in the data frame has fewer than "n" sequences - e.g n_sequences(n =5, results)
. This is why I added a tryCatch()
statement to override such occurrences.
CodePudding user response:
Here‘s some dplyr code:
library(tidyverse)
my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
# A tibble: 8 x 2
Sequence n
<chr> <int>
1 HHH 112
2 HHT 106
3 HTH 108
4 HTT 93
5 THH 97
6 THT 97
7 TTH 93
8 TTT 94