Dplyr - add column that searches for difference of <30 days between two date columns in different-CodePudding

I have a dataset that I can reduce to three columns - CustomerID, EnterDate, ReturnDate. I would like to add a 3rd column which states whether or not, if a CustomerID appears more than once in the dataset, the second 'EnterDate' is within 30 days of the first 'ExitDate' (and the third is within 30 days of the second etc. if there are multiple entries for a single Customer ID).

So to turn a table like this:

CustomerID	EnterDate	ExitDate
1	14/09/2021	15/09/2021
1	03/10/2021	11/10/2021
2	03/10/2021	01/10/2021
2	17/10/2021	11/11/2021
3	03/10/2021	11/10/2021
3	30/12/2021	31/12/2021
4	03/10/2021	09/07/2022

In to this - an entry of '1' is entered in new column 'ResaleWithin30' if CustomerID matches and 'EnterDate' is within 30 days of previous 'ExitDate'.

CustomerID	EnterDate	ExitDate	ResaleWithin30
1	14/09/2021	15/09/2021	0
1	03/10/2021	11/10/2021	1
2	03/10/2021	01/10/2021	0
2	17/10/2021	11/11/2021	1
3	03/10/2021	11/10/2021	0
3	30/12/2021	31/12/2021	0
4	03/10/2021	09/07/2022	0

The below code works for comparing just EnterDate to previous EnterDate but I'd like essentially EnterDate to compare to ExitDate. I assume I need to amend the mutate statement to apply to both EnterDate and ExitDate and then change the lag to compare EnterDate to ExitDate however I am getting in to various errors trying to get this completed so any amendment/help would be very much appreciated. Thank you!

library(dplyr)

df %>% 
  group_by(CustomerID) %>% 
  mutate(EnterDate = as.Date(EnterDate, tryFormats = "%d/%m/%Y"),
         ResaleWithin30 = as.integer(EnterDate - lag(EnterDate) <= 30),
         ResaleWithin30 = replace_na(ResaleWithin30, 0))

CodePudding user response：

I'm afraid I don't see what the problem is. If I've missed something, please explain.

You can mutate more than one column in a single call. Just sepatate the mutations with commas. You can even mutate the same column more than once, as you have done yourself. Or you can chain seveal calls to mutate in the same pipe.

Create some data

library(tidyverse)
# I prefer the consistency and additional functionality of lubridate to base R.
# Base R will suffice in this case.
library(lubridate)

df <- 
read.table(textConnection("CustomerID   EnterDate   ExitDate
1 14/09/2021 15/09/2021
1 03/10/2021 11/10/2021
2 03/10/2021 01/10/2021
2 17/10/2021 11/11/2021
3 03/10/2021 11/10/2021
3 30/12/2021 31/12/2021
4 03/10/2021 09/07/2022"), header=TRUE)

Solve the problem.

df %>% 
  group_by(CustomerID) %>% 
  mutate(
    EnterDate=dmy(EnterDate), 
    ExitDate=dmy(ExitDate),
    ResaleWithin30 = ifelse(EnterDate - lag(ExitDate) <= 30, 1, 0),
    ResaleWithin30 = replace_na(ResaleWithin30, 0)
  ) %>% 
  ungroup()
# A tibble: 7 × 4
  CustomerID EnterDate  ExitDate   ResaleWithin30
       <int> <date>     <date>              <dbl>
1          1 2021-09-14 2021-09-15              0
2          1 2021-10-03 2021-10-11              1
3          2 2021-10-03 2021-10-01              0
4          2 2021-10-17 2021-11-11              1
5          3 2021-10-03 2021-10-11              0
6          3 2021-12-30 2021-12-31              0
7          4 2021-10-03 2022-07-09              0

Given your test data, the call to replace_na() is redundant.

CodePudding user response：

I think the below might work.

df %>% 
  group_by(CustomerID) %>% 
  mutate(EnterDate = as.Date(EnterDate, tryFormats = "%d/%m/%Y"),
         ResaleWithin30 = as.integer(EnterDate - lag(ExitDate) <= 30),
         ResaleWithin30 = replace_na(ResaleWithin30, 0))

It is easier to help if you provide a reprex