I have a dataset that I can reduce to two columns - firstly, CustomerID, secondly, SaleDate. I have already taken steps to ensure that this dataset includes only customers with at least two sales. I would like to add a 3rd column which states whether or not the second sale (or 3rd sale compared to the second etc.) was within 30 days of the previous sale.
CustomerID | SaleDate |
---|---|
1 | 14/09/2021 |
1 | 30/09/2021 |
1 | 03/10/2021 |
2 | 23/07/2021 |
2 | 24/07/2021 |
With the new column entitled 'SaleDate' adding a '1' if the resale is within 30 days of the previous sale and a '0' if not (e.g. it's the first sale or not within 30 of the previous sale). So the end result in this case would be:
CustomerID | SaleDate | ResaleWithin30 |
---|---|---|
1 | 14/09/2021 | 0 |
1 | 30/09/2021 | 1 |
1 | 03/10/2021 | 1 |
2 | 23/07/2021 | 0 |
2 | 24/07/2021 | 1 |
Any advice on this would be very much appreciated, thank you.
CodePudding user response:
You can use lag
to check whether the next record is within 30 days of the present record.
library(dplyr)
df %>%
group_by(CustomerID) %>%
mutate(SaleDate = as.Date(SaleDate, tryFormats = "%d/%m/%Y"),
ResaleWithin30 = as.integer(SaleDate - lag(SaleDate) <= 30),
ResaleWithin30 = replace_na(ResaleWithin30, 0))
# A tibble: 5 × 3
# Groups: CustomerID [2]
CustomerID SaleDate ResaleWithin30
<int> <date> <int>
1 1 2021-09-14 0
2 1 2021-09-30 1
3 1 2021-10-03 1
4 2 2021-07-23 0
5 2 2021-07-24 1