R - summarize by name based on (follow up) date-CodePudding

I'm trying to create a dataset that shows how students move through courses. I have a dataset like this:

   InvoiceDate StudentName        Course                    
   <date>      <fct>             <fct>                                     
 1 2020-07-26  Tom                Level 1
 2 2020-11-05  Tom                Level 2    
 3 2021-11-05  Tom                Level 3    
 4 2018-10-15  Mary               Level 1                                      
 5 2020-08-06  Mary               Level 2                                    
 6 2021-10-10  Mary               Level 2

What I would like to know is which course is followed after a student does a certain level, and also when the student did not do any follow-up courses. The dataset I would like to create is this:

   FullName  StartCourseDate   StartCourse    FollowUpCourse   FollowUpCourseDate        
 1 Tom       2020-07-26        Level 1        Level 2          2020-11-05                   
 2 Tom       2020-11-05        Level 2        Level 3          2021-11-05
 2 Tom       2021-11-05        Level 3        Stop             Stop
 3 Mary      2018-10-15        Level 1        Level 2          2020-08-06
 4 Mary      2020-08-06        Level 2        Level 2          2021-10-10
 4 Mary      2021-10-10        Level 2        Stop             Stop

I have tried different things tidyverse / dplyr but I can't get the rows in the right order. Hopefully someone can help :)

CodePudding user response：

First arrange by date (make sure it's class date). Then add the leading data followed by renameing and relocateing the columns.

library(dplyr)

df %>% 
  group_by(StudentName) %>% 
  arrange(InvoiceDate, .by_group = TRUE) %>% 
  mutate(FollowUpCourse = lead(Course, default = "Stop"), 
    FollowUpCourseDate = lead(as.character(InvoiceDate), default = "Stop") ) %>% 
  rename(StartCourseDate = InvoiceDate, FullName = StudentName, StartCourse = Course) %>% 
  relocate(FullName, .before = StartCourseDate) %>%
  ungroup()
# A tibble: 6 × 5
  FullName StartCourseDate StartCourse FollowUpCourse FollowUpCourseDate
  <chr>    <date>          <chr>       <chr>          <chr>             
1 Mary     2018-10-15      Level 1     Level 2        2020-08-06        
2 Mary     2020-08-06      Level 2     Level 2        2021-10-10        
3 Mary     2021-10-10      Level 2     Stop           Stop              
4 Tom      2020-07-26      Level 1     Level 2        2020-11-05        
5 Tom      2020-11-05      Level 2     Level 3        2021-11-05        
6 Tom      2021-11-05      Level 3     Stop           Stop

Data

df <- structure(list(InvoiceDate = structure(c(18469, 18571, 18936, 
17819, 18480, 18910), class = "Date"), StudentName = c("Tom", 
"Tom", "Tom", "Mary", "Mary", "Mary"), Course = c("Level 1", 
"Level 2", "Level 3", "Level 1", "Level 2", "Level 2")), row.names = c(NA, 
-6L), class = "data.frame")

CodePudding user response：

You can get your outcome by mutating and using lead, using across allows you to condense your mutate calls.

library(dplyr)

df |> 
  mutate(Name = StudentName, StartCourseDate = InvoiceDate,
         StartCourse = Course, across(c( Course, InvoiceDate), ~ 
    ifelse(Course != "Level 3", lead(as.character(.x), default = "Stop"), "Stop"), 
                                      .names = "FollowUp{.col}"),  .keep = "unused")

  Name StartCourseDate StartCourse FollowUpCourse FollowUpInvoiceDate
1  Tom      2020-07-26     Level 1        Level 2          2020-11-05
2  Tom      2020-11-05     Level 2        Level 3          2021-11-05
3  Tom      2021-11-05     Level 3           Stop                Stop
4 Mary      2018-10-15     Level 1        Level 2          2020-08-06
5 Mary      2020-08-06     Level 2        Level 2          2021-10-10
6 Mary      2021-10-10     Level 2           Stop                Stop