Home > Software design >  How do I count the amount of a variable?
How do I count the amount of a variable?

Time:03-31

I have a data set with student names, their student ID, and the date they took an exam taken. It looks a bit like this:

Student ID Student Name Date Taken
01 Christian 2009-09-24
01 Christian 2015-06-16
01 Christian 2014-05-14
12 Laura 2017-01-04
12 Laura 2009-05-12
43 Alex 2014-05-07
43 Alex 2009-05-01
43 Alex 2009-08-24
43 Alex 2013-04-29
06 Sarah 2018-05-03
06 Sarah 2015-06-24

I want to create a variable that accounts for the cumulative number of exams taken by each student, without counting the first exam they have taken (based on the date they took it).

I tried doing this by:

count(Student ID, "01")
count(Student ID, "02")

etc.

But I feel that this is not the most efficient way to do it and it also doesn't give me a good summary of the data by student name. I want to use student ID vs student name, because some names repeat, yet their student ID is different. I'm unsure how to go about this code, or if I am in the right track.

I was thinking as well of making an extra column where each exam value is counted as 1, and first projects launched as 0. And then adding the numbers per ID. But I am also unsure on how to go about the code.

Any suggestions? Thank you a million beforehand!

CodePudding user response:

If you want to add a new column with the cumulative sum, then we can group by id and arrange by date, then create 0 for the first row and 1 for the other rows, then do cumsum.

library(tidyverse)

df %>%
  group_by(`Student ID`) %>%
  arrange(`Student ID`, `Date Taken`) %>%
  mutate(exams = ifelse(row_number()==1, 0, 1),
         exams = cumsum(exams))

Output

   Student.ID Student.Name Date.Taken exams
        <int> <chr>        <chr>      <dbl>
 1          1 Christian    2009-09-24     0
 2          1 Christian    2014-05-14     1
 3          1 Christian    2015-06-16     2
 4          6 Sarah        2015-06-24     0
 5          6 Sarah        2018-05-03     1
 6         12 Laura        2009-05-12     0
 7         12 Laura        2017-01-04     1
 8         43 Alex         2009-05-01     0
 9         43 Alex         2009-08-24     1
10         43 Alex         2013-04-29     2
11         43 Alex         2014-05-07     3

Data

df <- structure(list(`Student ID` = c(1L, 1L, 1L, 12L, 12L, 43L, 43L, 
43L, 43L, 6L, 6L), `Student Name` = c("Christian", "Christian", 
"Christian", "Laura", "Laura", "Alex", "Alex", "Alex", "Alex", 
"Sarah", "Sarah"), `Date Taken` = c("2009-09-24", "2015-06-16", 
"2014-05-14", "2017-01-04", "2009-05-12", "2014-05-07", "2009-05-01", 
"2009-08-24", "2013-04-29", "2018-05-03", "2015-06-24")), class = "data.frame", row.names = c(NA, 
-11L))

CodePudding user response:

If your dataframe as above is d, then you can use the dplyr package, and group by Student ID

library(dplyr)

d %>% 
  group_by(`Student ID`, `Student Name`) %>% 
  summarize(exams = n()-1)

Output:

  `Student ID` `Student Name` exams
         <int> <chr>          <dbl>
1            1 Christian          2
2            6 Sarah              1
3           12 Laura              1
4           43 Alex               3

If you want each row retained, and the count of exams, starting at 0, you can do this:

d %>% 
  arrange(`Date Taken`) %>% 
  group_by(`Student ID`) %>% 
  mutate(exams = row_number()-1)
  • Related