In R, how can I get the sum for all combinations of two variables?-CodePudding

I have a long dataset with students grades and subjects. I want to keep a long dataset, but I want to add a column that tells me how many Fs a student had in their humanity courses (English & History) and their STEM courses (biology & math). I also want the same for Ds, Cs, Bs, and As.

I know I could explicitly spell this out, but in the future, they might have other subjects (like adding Chemistry to STEM) or completely different categories, like Foreign Languages, so I want it to be scalable.

I know how to get all combinations of columns, and I know how to do to each part manually--but I don't know how to combine the two. Any help would be greatly appreciated!

#Sample data
library(tidyverse)

student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
                      subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
                      grade = c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4))

#All combinations of grades and subjects
all_subject_combos <- c("eng|his", "bio|math")
all_grades <- c("F", "D", "C", 
             "B", "A")

subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)

all_combos <- subjects_and_letter_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names = str_replace_all(names, "\\|", "_")) %>%
  pull(names)

#Manual generation of numbers of Fs by subject
#This is what I want the results to look like, but with all other letter grades

student_grades %>%
  group_by(student_id) %>%
  mutate(eng_his_F = sum((case_when(
    str_detect(subject, "eng|his") & grade == 1 ~ 1, 
    TRUE ~ 0)), na.rm = TRUE),
bio_math_F = sum((case_when(
  str_detect(subject, "bio|math") & grade == 1 ~ 1, 
  TRUE ~ 0)), na.rm = TRUE)) %>%
ungroup()

Ideally, this would be scalable for any number of subject combos and wouldn't require me to write out the same code for Ds, Cs, Bs, and As. Thank you!

CodePudding user response：

We may use

library(dplyr)
library(purrr)
library(stringr)
map_dfc(all_combos, ~ student_grades %>% 
  group_by(student_id) %>%
  transmute(!! .x := sum(case_when(str_detect(subject,
   str_replace(.x, "(\\w )_(\\w )_.", "\\1|\\2")) &
    grade == match(str_extract(.x, ".$"), all_grades)~ 1, TRUE ~ 0))) %>%
  ungroup %>% 
  dplyr::select(-student_id)) %>%
  bind_cols(student_grades, .)

-output

# A tibble: 18 × 13
   student_id subject grade eng_his_F bio_math_F eng_his_D bio_math_D eng_his_C bio_math_C eng_his_B bio_math_B eng_hi…¹ bio_m…²
        <dbl> <chr>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>    <dbl>   <dbl>
 1          1 english     1         1          0         0          1         0          1         1          0        0       0
 2          1 biology     2         1          0         0          1         0          1         1          0        0       0
 3          1 math        3         1          0         0          1         0          1         1          0        0       0
 4          1 history     4         1          0         0          1         0          1         1          0        0       0
 5          2 english     5         0          0         1          0         0          1         0          1        1       0
 6          2 biology     4         0          0         1          0         0          1         0          1        1       0
 7          2 math        3         0          0         1          0         0          1         0          1        1       0
 8          2 history     2         0          0         1          0         0          1         0          1        1       0
 9          3 english     2         1          1         1          0         0          0         0          1        0       0
10          3 biology     4         1          1         1          0         0          0         0          1        0       0
11          3 math        1         1          1         1          0         0          0         0          1        0       0
12          3 history     1         1          1         1          0         0          0         0          1        0       0
13          4 english     1         1          1         0          1         1          0         0          0        0       0
14          4 biology     1         1          1         0          1         1          0         0          0        0       0
15          4 math        2         1          1         0          1         1          0         0          0        0       0
16          4 history     3         1          1         0          1         1          0         0          0        0       0
17          5 <NA>        3         0          0         0          0         0          0         0          1        0       0
18          5 biology     4         0          0         0          0         0          0         0          1        0       0
# … with abbreviated variable names ¹eng_his_A, ²bio_math_A

CodePudding user response：

Here's another way of looking at it. I use a small mapping table (subject_to_field) which maps the subject to it's field (english -> humanities, math -> STEM etc.). I thought that this may help scalability. You need to maintain this table as subjects are added or removed.

The left_join then combines the field with the student_grades tibble.

Adding the column "grade2" is not needed but improves readability. Finally, all we need to to do is to perform the appropriate grouping and counting. In this approach you will not get a zero tally for grades that do not occur for a student.

library(tidyverse)

student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
                         subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
                         grade = c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4))

student_grades <- student_grades %>%
  mutate(grade2 = case_when(
    grade == 1 ~ "A",
    grade == 2 ~ "B", 
    grade == 3 ~ "C", 
    grade == 4 ~ "D", 
    grade == 5 ~ "F"))

subject_to_field <- tibble(
  subject = c("biology", "english", "history", "math"),
  field = c("STEM", "Humanities", "Humanities", "STEM")
)

student_grades <- student_grades %>%
  left_join(subject_to_field, by = c("subject" = "subject"))


student_summary <- student_grades %>%
  group_by(student_id, field, subject, grade2) %>%
  summarise(count = n())

Which will give you this output:

> student_summary
# A tibble: 18 × 5
# Groups:   student_id, field, subject [18]
   student_id field      subject grade2 count
        <dbl> <chr>      <chr>   <chr>  <int>
 1          1 Humanities english A          1
 2          1 Humanities history D          1
 3          1 STEM       biology B          1
 4          1 STEM       math    C          1
 5          2 Humanities english F          1
 6          2 Humanities history B          1
 7          2 STEM       biology D          1
 8          2 STEM       math    C          1
 9          3 Humanities english B          1
10          3 Humanities history A          1
11          3 STEM       biology D          1
12          3 STEM       math    A          1
13          4 Humanities english A          1
14          4 Humanities history C          1
15          4 STEM       biology A          1
16          4 STEM       math    B          1
17          5 STEM       biology D          1
18          5 NA         NA      C          1