I'm trying to summarize a data set based on the time between, within a group. Basically what I have is, like this example, a data set where students buy a course, the courses have levels : 1 till 5, every course is about 2 months and students can just follow the courses from 1 till 5. I want to analyze when students:
- drop out. So that would be, when they have not bought a new course after 2 months
- What the last course was, after they have not bought a course in 2 months
- and if they, after they stopped later decided to continue again
and example of my data set is this:
> data
Student Buy_Date Course
A 01-01-2019 Level 1
A 01-03-2019 Level 2
A 01-05-2019 Level 3
A 01-07-2019 Level 4
B 01-01-2019 Level 1
B 01-03-2019 Level 2
B 01-09-2019 Level 3
What I would like to have as a result is:
> Data_Result
Student First_Buy First_Course Last_Course Stopped_Date Continued? Course_Continued
A 01-01-2019 Level 1 Level 4 01-07-2019 No NA
B 01-01-2019 Level 1 Level 2 01-03-2019 Yes Level 3
I have tried to search for options like this but I'm not getting further then: ordering the row within group based on time and calculate the time between the previous. But I get stuck after that.. Could someone help me?
CodePudding user response:
Here's a way to get the expected output. The key thing here is to compute the number of months between two dates (Diff_Months
). After that, each variables can be summarised.
library(tidyverse)
library(lubridate)
df %>%
group_by(Student) %>%
mutate(Buy_Date = dmy(Buy_Date),
Diff_Months = abs((interval(Buy_Date, lag(Buy_Date, default = first(Buy_Date))) %/% months(1)))) %>%
summarise(First_Buy = first(Buy_Date),
First_Course = first(Course),
Last_Course = last(Course[Diff_Months <= 2]),
Stopped_Date = Buy_Date[Course == Last_Course],
Continued = ifelse(Last_Course == last(Course), "No", "Yes"),
Course_Continued = ifelse(Continued == "Yes", last(Course), NA))
# A tibble: 2 x 7
Student First_Buy First_Course Last_Course Stopped_Date Continued Course_Continued
<chr> <date> <chr> <chr> <date> <chr> <chr>
1 A 2019-01-01 Level 1 Level 4 2019-07-01 No NA
2 B 2019-01-01 Level 1 Level 2 2019-03-01 Yes Level 3
data
df <- read.table(header = T, text = "Student Buy_Date Course
A 01-01-2019 'Level 1'
A 01-03-2019 'Level 2'
A 01-05-2019 'Level 3'
A 01-07-2019 'Level 4'
B 01-01-2019 'Level 1'
B 01-03-2019 'Level 2'
B 01-09-2019 'Level 3'")