I have a dataset of students, and I want to create a column that categorizes each student based on their and other students' data in their "subject" and "grade_letter" column because I eventually want to create a column that says "math_science_A", "science_B", etc. to know how many students got As, Bs, Cs, etc. based on the unique make ups.
I have a list of all potential combinations that a student can have, but when I try to iterate over R, my case_when only gives me the first category. I then tried to make multiple columns, so I could pivot longer later, but that also seemed to fail. Any help would be appreciated!
#Here's my sample dataset
library(tidyverse)
test <- tibble(student_id = c(1),
subject = c("math"),
grade_num = c(95),
grade = c("a"))
#Here's a sample of what I want (full desired output would be too long to write; see text below code chunks)
sample_of_desired_output <- tibble(student_id = c(1),
subject = c("math"),
grade_num = c(95),
grade = c("a"),
category = c("n_math_a", "n_science_a", "n_english_a", "n_math_science_a", "n_math_english_a"))
#####Here's what I tried
#Getting all combinations--this part works
all_subjects <- c("math", "science", "english")
i = 1
while(i <= length(all_subjects)) {
if(i == 1) {
all_subject_combos <- c()
} else {
some_combos <- combn(all_subjects, i, FUN = function(x) paste(x, collapse = "_"), simplify = TRUE)
all_subject_combos <- append(all_subject_combos, some_combos)
}
i <- i 1
}
all_grades <- c("A", "B", "C", "D", "F")
all_subjects_and_grades <- expand.grid(all_subject_combos, all_grades)
subjects_and_grades_combos <- all_subjects_and_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = tolower(paste0("n_", str_replace_all(names, "\\|", "_")))) %>%
pull(names)
#This part does not work but is what I tried:
i = 1
while(i <= length(subjects_and_grades_combos)) {
does_not_work <- test %>%
mutate("subject_grade_grouping_{{i}}" := case_when(
str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
i <- i 1
}
does_not_work
Note that in the final, there will be all potential combinations, so each student will have their data repeating many times. After I have this, I will be using this to calculate the number of students who fall into each category. So, in this example, math_a, math_science_a, math_english_a, and math_science_english_a would all have 1. Note that I do not need help calculating that part, and I am flexible with the order of the columns.
CodePudding user response:
The while
loops needs to update the original object or else it gets the last iteration update
library(dplyr)
library(tidyr)
test1 <- test
i = 1
while(i <= length(subjects_and_grades_combos)) {
test1 <- test1 %>%
mutate("subject_grade_grouping_{{i}}" := case_when(
str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
i <- i 1
}
test1 %>%
pivot_longer(cols = starts_with('subject_grade'),
names_to = NULL, values_to = "category", values_drop_na = TRUE)
-output
# A tibble: 15 × 5
student_id subject grade_num grade category
<dbl> <chr> <dbl> <chr> <chr>
1 1 math 95 a n_math_science_a
2 1 math 95 a n_math_english_a
3 1 math 95 a n_math_science_english_a
4 1 math 95 a n_math_science_b
5 1 math 95 a n_math_english_b
6 1 math 95 a n_math_science_english_b
7 1 math 95 a n_math_science_c
8 1 math 95 a n_math_english_c
9 1 math 95 a n_math_science_english_c
10 1 math 95 a n_math_science_d
11 1 math 95 a n_math_english_d
12 1 math 95 a n_math_science_english_d
13 1 math 95 a n_math_science_f
14 1 math 95 a n_math_english_f
15 1 math 95 a n_math_science_english_f
CodePudding user response:
If you're just looking for how many students got each grade for each class:
library(dplyr); library(tidyr)
test %>%
count(subject, grade) %>%
complete(subject = c("math", "science", "english"),
grade = c("a", "b", "c", "d", "f"), fill = list(n=0)) %>%
pivot_wider(names_from = c(subject, grade), values_from = n)
# A tibble: 1 × 15
english_a english_b english_c english_d english_f math_a math_b math_c math_d math_f science_a scienc…¹ scien…² scien…³ scien…⁴
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
# … with abbreviated variable names ¹science_b, ²science_c, ³science_d, ⁴science_f
It sounds like you also want combinations of some sort, but it's not clear to me what you mean by that. Are we looking for the number of students who got an A in both math and science, or an A at least one of math and science? The sample data and output doesn't give a sense of what that should look like.