R - How to subtract serial dates based on values in another column using R-CodePudding

I want to subtract serial dates that are associated with the same value in another column. I am trying to find the time between order dates for each customer. Say, I have a dataframe:

customerid <- c("A1", "A1", "A2", "A2", "A3", "A3", "A3", "A4")
orderdate <- c("2018-09-14", "2020-08-20", "2018-09-15", "2019-08-25", "2018-09-16", "2020-08-21","2017-09-12", "2018-08-10")
df <- data.frame(customerid, orderdate)

My desired output is creation of a new column called time_lapse:

customerid	orderdate	time_lapse
A1	2018-09-14	706
A1	2020-08-20
A2	2018-09-15	1
A2	2018-09-16
A3	2017-09-12	712
A3	2019-08-25	362
A3	2020-08-21
A4	2018-08-10

So far, I have this code but I am having trouble with how to proceed from here. Would a loop be reasonable here? But I also have a large dataset >50,000 customer ids. Thanks!

    time_lapse <- df[df$customerid %in% df$customerid[duplicated(df$customerid)],] #Subset of only customerids that have >1 occurrence
    time_lapse <- time_lapse %>% 
         group_by(customerid) %>% 
         mutate(number_occurrences = 1:n())  #counts no. times a customerid repeats

CodePudding user response：

You can try,

library(dplyr)

df %>% 
 group_by(customerid) %>% 
 filter(n() > 1) %>%               # filter groups with more than 1 rows
 mutate(orderdate = as.POSIXct(orderdate), 
        time_lapse = as.numeric(orderdate - lag(orderdate)))

# A tibble: 7 × 3
# Groups:   customerid [3]
  customerid orderdate           time_lapse
  <chr>      <dttm>                   <dbl>
1 A1         2018-09-14 00:00:00         NA
2 A1         2020-08-20 00:00:00        706
3 A2         2018-09-15 00:00:00         NA
4 A2         2019-08-25 00:00:00        344
5 A3         2018-09-16 00:00:00         NA
6 A3         2020-08-21 00:00:00        705
7 A3         2017-09-12 00:00:00      -1074

CodePudding user response：

We could use lead after arrange-ing the data:

library(dplyr)

df |>
  mutate(orderdate = as.Date(orderdate)) |>
  group_by(customerid) |>
  arrange(customerid, orderdate) |>
  mutate(time_lapse = lead(orderdate) - orderdate) |>
  ungroup()

Output:

# A tibble: 8 × 3
  customerid orderdate  time_lapse
  <chr>      <date>     <drtn>    
1 A1         2018-09-14 706 days  
2 A1         2020-08-20  NA days  
3 A2         2018-09-15 344 days  
4 A2         2019-08-25  NA days  
5 A3         2017-09-12 369 days  
6 A3         2018-09-16 705 days  
7 A3         2020-08-21  NA days  
8 A4         2018-08-10  NA days

Beware that the data in the table and in the code doesn't correspond.

CodePudding user response：

Considering you want several things, both time difference and count, I combined it all into one.

dat <- df %>% 
  mutate(orderdate = as.Date(orderdate)) %>% 
  group_by(customerid) %>% 
  filter(n() > 1) %>% 
  arrange(customerid, orderdate) %>% 
  mutate(elapsed_time = orderdate - first(orderdate),
         number_occurrences = 1:n())

  customerid orderdate  elapsed_time number_occurrences
  <chr>      <date>     <drtn>                    <int>
1 A1         2018-09-14    0 days                     1
2 A1         2020-08-20  706 days                     2
3 A2         2018-09-15    0 days                     1
4 A2         2019-08-25  344 days                     2
5 A3         2017-09-12    0 days                     1
6 A3         2018-09-16  369 days                     2
7 A3         2020-08-21 1074 days                     3

CodePudding user response：

Here is another way:

library(dplyr)
df <- df %>%
group_by(customerid) %>%
filter(n()>1) %>%
arrange(customerid, orderdate) %>%
mutate(time_lapse = as.numeric(c(diff.Date(as.Date(orderdate)),NA_real_)))

The output is:

  customerid orderdate  time_lapse
  <chr>      <chr>           <dbl>
  A1         2018-09-14        706
  A1         2020-08-20         NA
  A2         2018-09-15        344
  A2         2019-08-25         NA
  A3         2017-09-12        369
  A3         2018-09-16        705
  A3         2020-08-21         NA