I have a table that looks like so (1 example - total of 2 million rows):
tweet_id | id | group | created_at | tweet | response_tweet_id
1 sprintcare Support Tue Oct 31 22:10:47 0000 2017 @115712 I understand. I would like to assist you. 2
2 115712 Customer Tue Oct 31 22:11:45 0000 2017 @sprintcare and how do you propose we do that? NA
3 115712 Customer Tue Oct 31 22:08:27 0000 2017 @sprintcare I have sent several private messages. 1
4 sprintcare Support Tue Oct 31 21:54:49 0000 2017 @115712 Please send us a Private Message. 3
5 115712 Customer Tue Oct 31 21:49:35 0000 2017 @sprintcare I did. 4
6 sprintcare Support Tue Oct 31 21:46:24 0000 2017 @115712 Can you please send us a private message. 5,7
8 115712 Customer Tue Oct 31 21:45:10 0000 2017 @sprintcare is the worst customer service 9,6,10
11 apple Support Tue Oct 31 22:10:35 0000 2017 @115713 This is saddening to hear. NA
12 115713 Customer Tue Oct 31 22:04:47 0000 2017 @apple My phone is not working. Help! 11
When referencing the first row, we can see the time the first support tweet response came in when responding to tweet_id 1 (this is labeled as 2).
Ideally - I would like to calculate how long it takes for each response tweet to come in - ordering from support to customer.
I would like to calculate two values:
the response time between the first tweet made to Support (tweet_id, response_tweet_id, and created_at). For the first row - the time difference would be: Tue Oct 31 22:10:47 0000 2017 - Tue Oct 31 22:08:27 0000 2017 = 00:02:20.
Total response time between first tweet from support to last tweet given each customer. In the example below - it would essentially be the first tweet made relative to time to support and the last tweet made before next unique id comes into play.
Desired output would look like so based on each group (100 unique companies):
id | group | Avg_response_time_per_tweet (in minutes) | Total_avg_response_time (in minutes)
sprintcare Support #### ####
apple Support #### ####
CodePudding user response:
library(lubridate)
library(dplyr)
library(tidyr)
df %>%
separate(created_at, c("Day_name", "Month", "Day", "Hour", "Minute", "Second", "X", "Year")) %>%
type.convert(as.is = TRUE) %>%
mutate(Month = match(Month, month.abb)) %>%
mutate(created_at = make_datetime(Year, Month, Day, Hour, Minute, Second), .keep="unused") %>%
group_by(id, group) %>%
summarise(Avg_response_time = mean(difftime(max(created_at), min(created_at))))
id group Avg_response_time
<chr> <chr> <drtn>
1 115712 Customer 26.58333 mins
2 sprintcare Support 24.38333 mins
CodePudding user response:
It's not a really performant approach, but it should yield the desired results.
library(dplyr)
library(stringr)
dat %>%
mutate(id_at = str_extract(tweet, "\\d "),
global_id = ifelse(is.na(id_at), id, id_at)) %>%
group_by(global_id) %>%
group_modify(~ .x %>%
rowwise %>%
mutate(response_time = if(group == "Support") {
filter(.x, created_at < .env$created_at, group == "Customer") %>%
slice_max(created_at, n = 1) %>% pull(created_at)
} else NA)
) %>%
summarise(id = first(id), Total_avg_response_time = difftime(max(created_at), min(created_at)),
Avg_response_time_per_tweet = mean(difftime(created_at, response_time), na.rm = TRUE))
#> # A tibble: 1 x 4
#> global_id id Total_avg_response_time Avg_response_time_per_tweet
#> <chr> <chr> <drtn> <drtn>
#> 1 115712 sprintcare 26.58333 mins 2.933333 mins
# data used
dat <- read.table(text = "tweet_id id group created_at tweet
1 sprintcare Support 'Tue Oct 31 22:10:47 0000 2017' '@115712 I understand. I would like to assist you.'
2 115712 Customer 'Tue Oct 31 22:11:45 0000 2017' '@sprintcare and how do you propose we do that'
3 115712 Customer 'Tue Oct 31 22:08:27 0000 2017' '@sprintcare I have sent several private messages and no one is responding as usual'
4 sprintcare Support 'Tue Oct 31 21:54:49 0000 2017' '@115712 Please send us a Private Message so that we can further assist you.'
5 115712 Customer 'Tue Oct 31 21:49:35 0000 2017' '@sprintcare I did.'
6 sprintcare Support 'Tue Oct 31 21:46:24 0000 2017' '@115712 Can you please send us a private message, so that I can gain further details about your account?'
7 115712 Customer 'Tue Oct 31 21:45:10 0000 2017' '@sprintcare is the worst customer service'",
header = TRUE)
dat <- dat %>%
mutate(created_at = strptime(created_at, "%a %b %d %H:%M:%S %z %Y"))
Created on 2021-09-15 by the reprex package (v0.3.0)