I'm trying to create a dataset that shows how students move through courses. I have a dataset like this:
InvoiceDate StudentName Course
<date> <fct> <fct>
1 2020-07-26 Tom Level 1
2 2020-11-05 Tom Level 2
3 2021-11-05 Tom Level 3
4 2018-10-15 Mary Level 1
5 2020-08-06 Mary Level 2
6 2021-10-10 Mary Level 2
What I would like to know is which course is followed after a student does a certain level, and also when the student did not do any follow-up courses. The dataset I would like to create is this:
FullName StartCourseDate StartCourse FollowUpCourse FollowUpCourseDate
1 Tom 2020-07-26 Level 1 Level 2 2020-11-05
2 Tom 2020-11-05 Level 2 Level 3 2021-11-05
2 Tom 2021-11-05 Level 3 Stop Stop
3 Mary 2018-10-15 Level 1 Level 2 2020-08-06
4 Mary 2020-08-06 Level 2 Level 2 2021-10-10
4 Mary 2021-10-10 Level 2 Stop Stop
I have tried different things tidyverse
/ dplyr
but I can't get the rows in the right order. Hopefully someone can help :)
CodePudding user response:
First arrange
by date (make sure it's class date
).
Then add the leading data followed by rename
ing and relocate
ing the columns.
library(dplyr)
df %>%
group_by(StudentName) %>%
arrange(InvoiceDate, .by_group = TRUE) %>%
mutate(FollowUpCourse = lead(Course, default = "Stop"),
FollowUpCourseDate = lead(as.character(InvoiceDate), default = "Stop") ) %>%
rename(StartCourseDate = InvoiceDate, FullName = StudentName, StartCourse = Course) %>%
relocate(FullName, .before = StartCourseDate) %>%
ungroup()
# A tibble: 6 × 5
FullName StartCourseDate StartCourse FollowUpCourse FollowUpCourseDate
<chr> <date> <chr> <chr> <chr>
1 Mary 2018-10-15 Level 1 Level 2 2020-08-06
2 Mary 2020-08-06 Level 2 Level 2 2021-10-10
3 Mary 2021-10-10 Level 2 Stop Stop
4 Tom 2020-07-26 Level 1 Level 2 2020-11-05
5 Tom 2020-11-05 Level 2 Level 3 2021-11-05
6 Tom 2021-11-05 Level 3 Stop Stop
Data
df <- structure(list(InvoiceDate = structure(c(18469, 18571, 18936,
17819, 18480, 18910), class = "Date"), StudentName = c("Tom",
"Tom", "Tom", "Mary", "Mary", "Mary"), Course = c("Level 1",
"Level 2", "Level 3", "Level 1", "Level 2", "Level 2")), row.names = c(NA,
-6L), class = "data.frame")
CodePudding user response:
You can get your outcome by mutating
and using lead
,
using across
allows you to condense your mutate calls.
library(dplyr)
df |>
mutate(Name = StudentName, StartCourseDate = InvoiceDate,
StartCourse = Course, across(c( Course, InvoiceDate), ~
ifelse(Course != "Level 3", lead(as.character(.x), default = "Stop"), "Stop"),
.names = "FollowUp{.col}"), .keep = "unused")
Name StartCourseDate StartCourse FollowUpCourse FollowUpInvoiceDate
1 Tom 2020-07-26 Level 1 Level 2 2020-11-05
2 Tom 2020-11-05 Level 2 Level 3 2021-11-05
3 Tom 2021-11-05 Level 3 Stop Stop
4 Mary 2018-10-15 Level 1 Level 2 2020-08-06
5 Mary 2020-08-06 Level 2 Level 2 2021-10-10
6 Mary 2021-10-10 Level 2 Stop Stop