I am working with the R programming language.
I have the following R code:
library(tidyverse)
library(dplyr)
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time) And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54) Here is some R code that performs these tasks:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
This code works fine on my home computer - but at school, we have an older version of R (we have no internet connection, i.e. can not update/install new packages). When I try running this code on the school computer - I get the following error:
Error: Column 'Sequence' must be length 1 (a summary value), not 12
My Question: I am trying to find a different way to re-write this code - or perhaps write it in Base R.
Thanks!
CodePudding user response:
If you want to get the same result in a base R without any packages, you can do:
setNames(as.data.frame(table(unlist(lapply(split(my_data, my_data$student_id),
function(d) {
x <- paste0(d$coin_result,
c(d$coin_result[-1], ""),
c(d$coin_result[-(1:2)], "", ""))
x[nchar(x) == 3]
})))), c("Sequence", "n"))
#> Sequence n
#> 1 HHH 112
#> 2 HHT 106
#> 3 HTH 108
#> 4 HTT 93
#> 5 THH 97
#> 6 THT 97
#> 7 TTH 93
#> 8 TTT 94
This may be a little easier to understand in pipe format if you have a newer version of R:
f <- function(d) {
x <- paste0(d, c(d[-1], ""), c(d[-(1:2)], "", ""))
x[nchar(x) == 3]
}
my_data$coin_result |>
split(my_data$student_id) |>
lapply(f) |>
unlist() |>
table() |>
as.data.frame() |>
setNames(c("Sequence", "n"))
CodePudding user response:
there's probably a conflict between filter
function of stats
& dplyr
package that you have loaded in your workspace. Force your program to use dplyr::filter
by doing
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
dplyr::filter(!is.na(Sequence)) %>%
count(Sequence)
to add to this you can investigate package conflicts in the future by using
library(conflicted)
conflict_scout()
Hope this helps ;)