I want to subtract serial dates that are associated with the same value in another column. I am trying to find the time between order dates for each customer. Say, I have a dataframe:
customerid <- c("A1", "A1", "A2", "A2", "A3", "A3", "A3", "A4")
orderdate <- c("2018-09-14", "2020-08-20", "2018-09-15", "2019-08-25", "2018-09-16", "2020-08-21","2017-09-12", "2018-08-10")
df <- data.frame(customerid, orderdate)
My desired output is creation of a new column called time_lapse:
customerid | orderdate | time_lapse |
---|---|---|
A1 | 2018-09-14 | 706 |
A1 | 2020-08-20 | |
A2 | 2018-09-15 | 1 |
A2 | 2018-09-16 | |
A3 | 2017-09-12 | 712 |
A3 | 2019-08-25 | 362 |
A3 | 2020-08-21 | |
A4 | 2018-08-10 |
So far, I have this code but I am having trouble with how to proceed from here. Would a loop be reasonable here? But I also have a large dataset >50,000 customer ids. Thanks!
time_lapse <- df[df$customerid %in% df$customerid[duplicated(df$customerid)],] #Subset of only customerids that have >1 occurrence
time_lapse <- time_lapse %>%
group_by(customerid) %>%
mutate(number_occurrences = 1:n()) #counts no. times a customerid repeats
CodePudding user response:
You can try,
library(dplyr)
df %>%
group_by(customerid) %>%
filter(n() > 1) %>% # filter groups with more than 1 rows
mutate(orderdate = as.POSIXct(orderdate),
time_lapse = as.numeric(orderdate - lag(orderdate)))
# A tibble: 7 × 3
# Groups: customerid [3]
customerid orderdate time_lapse
<chr> <dttm> <dbl>
1 A1 2018-09-14 00:00:00 NA
2 A1 2020-08-20 00:00:00 706
3 A2 2018-09-15 00:00:00 NA
4 A2 2019-08-25 00:00:00 344
5 A3 2018-09-16 00:00:00 NA
6 A3 2020-08-21 00:00:00 705
7 A3 2017-09-12 00:00:00 -1074
CodePudding user response:
We could use lead
after arrange
-ing the data:
library(dplyr)
df |>
mutate(orderdate = as.Date(orderdate)) |>
group_by(customerid) |>
arrange(customerid, orderdate) |>
mutate(time_lapse = lead(orderdate) - orderdate) |>
ungroup()
Output:
# A tibble: 8 × 3
customerid orderdate time_lapse
<chr> <date> <drtn>
1 A1 2018-09-14 706 days
2 A1 2020-08-20 NA days
3 A2 2018-09-15 344 days
4 A2 2019-08-25 NA days
5 A3 2017-09-12 369 days
6 A3 2018-09-16 705 days
7 A3 2020-08-21 NA days
8 A4 2018-08-10 NA days
Beware that the data in the table and in the code doesn't correspond.
CodePudding user response:
Considering you want several things, both time difference and count, I combined it all into one.
dat <- df %>%
mutate(orderdate = as.Date(orderdate)) %>%
group_by(customerid) %>%
filter(n() > 1) %>%
arrange(customerid, orderdate) %>%
mutate(elapsed_time = orderdate - first(orderdate),
number_occurrences = 1:n())
customerid orderdate elapsed_time number_occurrences
<chr> <date> <drtn> <int>
1 A1 2018-09-14 0 days 1
2 A1 2020-08-20 706 days 2
3 A2 2018-09-15 0 days 1
4 A2 2019-08-25 344 days 2
5 A3 2017-09-12 0 days 1
6 A3 2018-09-16 369 days 2
7 A3 2020-08-21 1074 days 3
CodePudding user response:
Here is another way:
library(dplyr)
df <- df %>%
group_by(customerid) %>%
filter(n()>1) %>%
arrange(customerid, orderdate) %>%
mutate(time_lapse = as.numeric(c(diff.Date(as.Date(orderdate)),NA_real_)))
The output is:
customerid orderdate time_lapse
<chr> <chr> <dbl>
A1 2018-09-14 706
A1 2020-08-20 NA
A2 2018-09-15 344
A2 2019-08-25 NA
A3 2017-09-12 369
A3 2018-09-16 705
A3 2020-08-21 NA