Home > other >  R: Converting Tidyverse into Base R
R: Converting Tidyverse into Base R

Time:02-01

I am working with the R programming language.

I have the following R code:

library(tidyverse)
library(dplyr)

set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)

my_data =  my_data[order(my_data$student_id),]

I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time) And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54) Here is some R code that performs these tasks:

results = my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence)

This code works fine on my home computer - but at school, we have an older version of R (we have no internet connection, i.e. can not update/install new packages). When I try running this code on the school computer - I get the following error:

Error: Column 'Sequence' must be length 1 (a summary value), not 12

My Question: I am trying to find a different way to re-write this code - or perhaps write it in Base R.

Thanks!

CodePudding user response:

If you want to get the same result in a base R without any packages, you can do:

setNames(as.data.frame(table(unlist(lapply(split(my_data, my_data$student_id),
       function(d) {
         x <- paste0(d$coin_result, 
                c(d$coin_result[-1], ""), 
                c(d$coin_result[-(1:2)], "", ""))
         x[nchar(x) == 3]
         })))), c("Sequence", "n"))
#>   Sequence   n
#> 1      HHH 112
#> 2      HHT 106
#> 3      HTH 108
#> 4      HTT  93
#> 5      THH  97
#> 6      THT  97
#> 7      TTH  93
#> 8      TTT  94

This may be a little easier to understand in pipe format if you have a newer version of R:

f <- function(d) {
  x <- paste0(d, c(d[-1], ""), c(d[-(1:2)], "", ""))
  x[nchar(x) == 3]
}

my_data$coin_result         |>
  split(my_data$student_id) |>
  lapply(f)                 |>
  unlist()                  |>
  table()                   |>
  as.data.frame()           |>
  setNames(c("Sequence", "n"))

CodePudding user response:

there's probably a conflict between filter function of stats & dplyr package that you have loaded in your workspace. Force your program to use dplyr::filter by doing

results = my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
dplyr::filter(!is.na(Sequence)) %>%
  count(Sequence)

to add to this you can investigate package conflicts in the future by using

library(conflicted)
conflict_scout()

Hope this helps ;)

  •  Tags:  
  • r
  • Related