Home > Software engineering >  Summarize based on time between
Summarize based on time between

Time:04-05

I'm trying to summarize a data set based on the time between, within a group. Basically what I have is, like this example, a data set where students buy a course, the courses have levels : 1 till 5, every course is about 2 months and students can just follow the courses from 1 till 5. I want to analyze when students:

  • drop out. So that would be, when they have not bought a new course after 2 months
  • What the last course was, after they have not bought a course in 2 months
  • and if they, after they stopped later decided to continue again

and example of my data set is this:

> data

Student     Buy_Date      Course
A           01-01-2019    Level 1
A           01-03-2019    Level 2
A           01-05-2019    Level 3
A           01-07-2019    Level 4
B           01-01-2019    Level 1
B           01-03-2019    Level 2
B           01-09-2019    Level 3

What I would like to have as a result is:

> Data_Result
    Student     First_Buy    First_Course   Last_Course   Stopped_Date   Continued?     Course_Continued      
    A           01-01-2019   Level 1        Level 4       01-07-2019     No             NA
    B           01-01-2019   Level 1        Level 2       01-03-2019     Yes            Level 3

I have tried to search for options like this but I'm not getting further then: ordering the row within group based on time and calculate the time between the previous. But I get stuck after that.. Could someone help me?

CodePudding user response:

Here's a way to get the expected output. The key thing here is to compute the number of months between two dates (Diff_Months). After that, each variables can be summarised.

library(tidyverse)
library(lubridate)

df %>% 
  group_by(Student) %>% 
  mutate(Buy_Date = dmy(Buy_Date),
         Diff_Months = abs((interval(Buy_Date, lag(Buy_Date, default = first(Buy_Date))) %/% months(1)))) %>% 
  summarise(First_Buy = first(Buy_Date),
            First_Course = first(Course),
            Last_Course = last(Course[Diff_Months <= 2]),
            Stopped_Date = Buy_Date[Course == Last_Course],
            Continued = ifelse(Last_Course == last(Course), "No", "Yes"),
            Course_Continued = ifelse(Continued == "Yes", last(Course), NA))

# A tibble: 2 x 7
  Student First_Buy  First_Course Last_Course Stopped_Date Continued Course_Continued
  <chr>   <date>     <chr>        <chr>       <date>       <chr>     <chr>           
1 A       2019-01-01 Level 1      Level 4     2019-07-01   No        NA              
2 B       2019-01-01 Level 1      Level 2     2019-03-01   Yes       Level 3         

data

df <- read.table(header = T, text = "Student     Buy_Date      Course
A           01-01-2019    'Level 1'
A           01-03-2019    'Level 2'
A           01-05-2019    'Level 3'
A           01-07-2019    'Level 4'
B           01-01-2019    'Level 1'
B           01-03-2019    'Level 2'
B           01-09-2019    'Level 3'")
  • Related