In R, how do I make a long dataset that contains all the categories that a row applies to?-CodePudding

I have a dataset of students, and I want to create a column that categorizes each student based on their and other students' data in their "subject" and "grade_letter" column because I eventually want to create a column that says "math_science_A", "science_B", etc. to know how many students got As, Bs, Cs, etc. based on the unique make ups.

I have a list of all potential combinations that a student can have, but when I try to iterate over R, my case_when only gives me the first category. I then tried to make multiple columns, so I could pivot longer later, but that also seemed to fail. Any help would be appreciated!

#Here's my sample dataset
library(tidyverse)
test <- tibble(student_id = c(1),
               subject = c("math"),
               grade_num = c(95),
               grade = c("a"))

#Here's a sample of what I want (full desired output would be too long to write; see text below code chunks)
sample_of_desired_output <- tibble(student_id = c(1),
               subject = c("math"),
               grade_num = c(95),
               grade = c("a"),
category = c("n_math_a", "n_science_a", "n_english_a", "n_math_science_a", "n_math_english_a"))

#####Here's what I tried
#Getting all combinations--this part works
all_subjects <- c("math", "science", "english")

i = 1
while(i <= length(all_subjects)) {
  
  if(i == 1) {
    all_subject_combos <- c()
  } else {
    some_combos <- combn(all_subjects, i,  FUN = function(x) paste(x, collapse = "_"), simplify = TRUE)
    all_subject_combos <- append(all_subject_combos, some_combos)
  }
  
  i <- i  1
}

all_grades <- c("A", "B", "C", "D", "F")

all_subjects_and_grades <- expand.grid(all_subject_combos, all_grades)

subjects_and_grades_combos <- all_subjects_and_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names =  tolower(paste0("n_", str_replace_all(names, "\\|", "_")))) %>%
  pull(names)


#This part does not work but is what I tried:

i = 1
while(i <= length(subjects_and_grades_combos)) {
  does_not_work <- test %>%
    mutate("subject_grade_grouping_{{i}}" := case_when(
      str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
  i <- i  1
}
does_not_work

Note that in the final, there will be all potential combinations, so each student will have their data repeating many times. After I have this, I will be using this to calculate the number of students who fall into each category. So, in this example, math_a, math_science_a, math_english_a, and math_science_english_a would all have 1. Note that I do not need help calculating that part, and I am flexible with the order of the columns.

CodePudding user response：

The while loops needs to update the original object or else it gets the last iteration update

library(dplyr)
library(tidyr)
test1 <- test
 i = 1
while(i <= length(subjects_and_grades_combos)) {
  test1 <- test1 %>%
     mutate("subject_grade_grouping_{{i}}" := case_when(
       str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
   i <- i  1
 }

test1 %>% 
  pivot_longer(cols = starts_with('subject_grade'),
   names_to = NULL, values_to = "category", values_drop_na = TRUE)

-output

# A tibble: 15 × 5
   student_id subject grade_num grade category                
        <dbl> <chr>       <dbl> <chr> <chr>                   
 1          1 math           95 a     n_math_science_a        
 2          1 math           95 a     n_math_english_a        
 3          1 math           95 a     n_math_science_english_a
 4          1 math           95 a     n_math_science_b        
 5          1 math           95 a     n_math_english_b        
 6          1 math           95 a     n_math_science_english_b
 7          1 math           95 a     n_math_science_c        
 8          1 math           95 a     n_math_english_c        
 9          1 math           95 a     n_math_science_english_c
10          1 math           95 a     n_math_science_d        
11          1 math           95 a     n_math_english_d        
12          1 math           95 a     n_math_science_english_d
13          1 math           95 a     n_math_science_f        
14          1 math           95 a     n_math_english_f        
15          1 math           95 a     n_math_science_english_f

CodePudding user response：

If you're just looking for how many students got each grade for each class:

library(dplyr); library(tidyr)
test %>%
  count(subject, grade) %>%
  complete(subject = c("math", "science", "english"),
           grade = c("a", "b", "c", "d", "f"), fill = list(n=0)) %>%
  pivot_wider(names_from = c(subject, grade), values_from = n)

# A tibble: 1 × 15
  english_a english_b english_c english_d english_f math_a math_b math_c math_d math_f science_a scienc…¹ scien…² scien…³ scien…⁴
      <int>     <int>     <int>     <int>     <int>  <int>  <int>  <int>  <int>  <int>     <int>    <int>   <int>   <int>   <int>
1         0         0         0         0         0      1      0      0      0      0         0        0       0       0       0
# … with abbreviated variable names ¹science_b, ²science_c, ³science_d, ⁴science_f

It sounds like you also want combinations of some sort, but it's not clear to me what you mean by that. Are we looking for the number of students who got an A in both math and science, or an A at least one of math and science? The sample data and output doesn't give a sense of what that should look like.