How to calculate avg response time & total response time based on group

I have a table that looks like so (1 example - total of 2 million rows):

tweet_id |  id         |  group   |                created_at         | tweet                                                 |  response_tweet_id
1           sprintcare    Support      Tue Oct 31 22:10:47  0000 2017   @115712 I understand. I would like to assist you.        2
2           115712        Customer     Tue Oct 31 22:11:45  0000 2017   @sprintcare and how do you propose we do that?           NA
3           115712        Customer     Tue Oct 31 22:08:27  0000 2017   @sprintcare I have sent several private messages.        1
4           sprintcare    Support      Tue Oct 31 21:54:49  0000 2017   @115712 Please send us a Private Message.                3
5           115712        Customer     Tue Oct 31 21:49:35  0000 2017   @sprintcare I did.                                       4
6           sprintcare    Support      Tue Oct 31 21:46:24  0000 2017   @115712 Can you please send us a private message.        5,7
8           115712        Customer     Tue Oct 31 21:45:10  0000 2017   @sprintcare is the worst customer service                9,6,10
11          apple         Support      Tue Oct 31 22:10:35  0000 2017  @115713 This is saddening to hear.                        NA
12          115713        Customer     Tue Oct 31 22:04:47  0000 2017   @apple My phone is not working. Help!               11

When referencing the first row, we can see the time the first support tweet response came in when responding to tweet_id 1 (this is labeled as 2).

Ideally - I would like to calculate how long it takes for each response tweet to come in - ordering from support to customer.

I would like to calculate two values:

the response time between the first tweet made to Support (tweet_id, response_tweet_id, and created_at). For the first row - the time difference would be: Tue Oct 31 22:10:47 0000 2017 - Tue Oct 31 22:08:27 0000 2017 = 00:02:20.
Total response time between first tweet from support to last tweet given each customer. In the example below - it would essentially be the first tweet made relative to time to support and the last tweet made before next unique id comes into play.

Desired output would look like so based on each group (100 unique companies):

id         | group  | Avg_response_time_per_tweet (in minutes) | Total_avg_response_time (in minutes)
sprintcare   Support  ####                                       ####
apple        Support  ####                                       ####

CodePudding user response：

library(lubridate)
library(dplyr)
library(tidyr)
df %>% 
    separate(created_at, c("Day_name", "Month", "Day", "Hour", "Minute", "Second", "X", "Year")) %>% 
    type.convert(as.is = TRUE) %>% 
    mutate(Month = match(Month, month.abb)) %>% 
    mutate(created_at = make_datetime(Year, Month, Day, Hour, Minute, Second), .keep="unused") %>% 
    group_by(id, group) %>% 
    summarise(Avg_response_time = mean(difftime(max(created_at), min(created_at))))

  id         group    Avg_response_time
  <chr>      <chr>    <drtn>           
1 115712     Customer 26.58333 mins    
2 sprintcare Support  24.38333 mins

CodePudding user response：

It's not a really performant approach, but it should yield the desired results.

library(dplyr)
library(stringr)

dat %>% 
  mutate(id_at = str_extract(tweet, "\\d "),
         global_id = ifelse(is.na(id_at), id, id_at)) %>% 
  group_by(global_id) %>% 
  group_modify(~ .x %>%
                 rowwise %>% 
                 mutate(response_time = if(group == "Support") {
                   filter(.x, created_at < .env$created_at, group == "Customer") %>% 
                     slice_max(created_at, n = 1) %>% pull(created_at)
                 } else NA)
  ) %>% 
  summarise(id = first(id), Total_avg_response_time = difftime(max(created_at), min(created_at)),
            Avg_response_time_per_tweet = mean(difftime(created_at, response_time), na.rm = TRUE))

#> # A tibble: 1 x 4
#>   global_id id         Total_avg_response_time Avg_response_time_per_tweet
#>   <chr>     <chr>      <drtn>                  <drtn>                     
#> 1 115712    sprintcare 26.58333 mins           2.933333 mins

# data used
dat <- read.table(text = "tweet_id   id             group                  created_at            tweet
                  1           sprintcare    Support      'Tue Oct 31 22:10:47  0000 2017'   '@115712 I understand. I would like to assist you.'
                  2           115712        Customer     'Tue Oct 31 22:11:45  0000 2017'   '@sprintcare and how do you propose we do that'
                  3           115712        Customer     'Tue Oct 31 22:08:27  0000 2017'   '@sprintcare I have sent several private messages and no one is responding as usual'
                  4           sprintcare    Support      'Tue Oct 31 21:54:49  0000 2017'   '@115712 Please send us a Private Message so that we can further assist you.'
                  5           115712        Customer     'Tue Oct 31 21:49:35  0000 2017'   '@sprintcare I did.'
                  6           sprintcare    Support      'Tue Oct 31 21:46:24  0000 2017'   '@115712 Can you please send us a private message, so that I can gain further details about your account?'
                  7           115712        Customer     'Tue Oct 31 21:45:10  0000 2017'   '@sprintcare is the worst customer service'",
                  header = TRUE)

dat <- dat %>% 
  mutate(created_at = strptime(created_at, "%a %b %d %H:%M:%S %z %Y"))

^{Created on 2021-09-15 by the reprex package (v0.3.0)}