I have a data set with student names, their student ID, and the date they took an exam taken. It looks a bit like this:
Student ID | Student Name | Date Taken |
---|---|---|
01 | Christian | 2009-09-24 |
01 | Christian | 2015-06-16 |
01 | Christian | 2014-05-14 |
12 | Laura | 2017-01-04 |
12 | Laura | 2009-05-12 |
43 | Alex | 2014-05-07 |
43 | Alex | 2009-05-01 |
43 | Alex | 2009-08-24 |
43 | Alex | 2013-04-29 |
06 | Sarah | 2018-05-03 |
06 | Sarah | 2015-06-24 |
I want to create a variable that accounts for the cumulative number of exams taken by each student, without counting the first exam they have taken (based on the date they took it).
I tried doing this by:
count(Student ID, "01")
count(Student ID, "02")
etc.
But I feel that this is not the most efficient way to do it and it also doesn't give me a good summary of the data by student name. I want to use student ID vs student name, because some names repeat, yet their student ID is different. I'm unsure how to go about this code, or if I am in the right track.
I was thinking as well of making an extra column where each exam value is counted as 1, and first projects launched as 0. And then adding the numbers per ID. But I am also unsure on how to go about the code.
Any suggestions? Thank you a million beforehand!
CodePudding user response:
If you want to add a new column with the cumulative sum, then we can group by id and arrange by date, then create 0 for the first row and 1 for the other rows, then do cumsum
.
library(tidyverse)
df %>%
group_by(`Student ID`) %>%
arrange(`Student ID`, `Date Taken`) %>%
mutate(exams = ifelse(row_number()==1, 0, 1),
exams = cumsum(exams))
Output
Student.ID Student.Name Date.Taken exams
<int> <chr> <chr> <dbl>
1 1 Christian 2009-09-24 0
2 1 Christian 2014-05-14 1
3 1 Christian 2015-06-16 2
4 6 Sarah 2015-06-24 0
5 6 Sarah 2018-05-03 1
6 12 Laura 2009-05-12 0
7 12 Laura 2017-01-04 1
8 43 Alex 2009-05-01 0
9 43 Alex 2009-08-24 1
10 43 Alex 2013-04-29 2
11 43 Alex 2014-05-07 3
Data
df <- structure(list(`Student ID` = c(1L, 1L, 1L, 12L, 12L, 43L, 43L,
43L, 43L, 6L, 6L), `Student Name` = c("Christian", "Christian",
"Christian", "Laura", "Laura", "Alex", "Alex", "Alex", "Alex",
"Sarah", "Sarah"), `Date Taken` = c("2009-09-24", "2015-06-16",
"2014-05-14", "2017-01-04", "2009-05-12", "2014-05-07", "2009-05-01",
"2009-08-24", "2013-04-29", "2018-05-03", "2015-06-24")), class = "data.frame", row.names = c(NA,
-11L))
CodePudding user response:
If your dataframe as above is d
, then you can use the dplyr
package, and group by Student ID
library(dplyr)
d %>%
group_by(`Student ID`, `Student Name`) %>%
summarize(exams = n()-1)
Output:
`Student ID` `Student Name` exams
<int> <chr> <dbl>
1 1 Christian 2
2 6 Sarah 1
3 12 Laura 1
4 43 Alex 3
If you want each row retained, and the count of exams, starting at 0, you can do this:
d %>%
arrange(`Date Taken`) %>%
group_by(`Student ID`) %>%
mutate(exams = row_number()-1)