In my report, each line has an ID, a status and an activity datetime; from that, I need to figure out duration between two relevant statuses for one ID. For example, the data might be as shown below:
ID | status | activitydatetime
123456 | Account Ready | 04-25-2022 06:30
123456 | In Progress | 04-25-2022 06:30
123456 | Additional Documentation Needed | 04-25-2022 06:48
123456 | Completed | 04-25-2022 06:52
234567| In Progress | 04-23-2022 08:25
234567| Query Pending | 04-23-2022 08:30
234567 | In Progress | 04-25-2022 13:00
234567| Completed | 04-25-2022 13:17
From that data, I need to find the duration between In Progress and Completed. I'm using R for this but just can't figure out the best way to do this. The biggest hurdle is writing a script that will run by acct num for each status; i.e., "for 123456, datediff(status = In Progress, status = Completed)" and then go on to the next acct num - but a for loop doesn't feel like the cleanest way to do this.
I've managed to get this working:
library(dplyr)
# duration of all activity per account
activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = min(activitydatetime),
last_activity = max(activitydatetime),
activity = last_activity - first_activity)
The issue is the report doesn't always start with status = In Progress, so I still need to be able to adjust my script to look at status...like, first_activity = the activitydatetime for the first time status = In Progress for ID 123456.
Unfortunately, doing the following gives me the literal first and last activitydatetime for the entire report.
activity <- df %>%
group_by(ID) %>%
summarize(
first_activity = df[first(which(df$status %in% "In Progress")),]$activitydatetime,
last_activity = df[last(which(df$status %in% "Completed")),]$activitydatetime,
activity = last_activity - first_activity)
Thanks for taking a look!
CodePudding user response:
One option would be to first filter
just to the needed rows. I also use slice
to take the earliest instance of In Progress
. I did this because there are 2 instances of In Progress
for ID = 234567
. Then, we can summarize using to get the difference between the two rows. Then, I added a mutate
statement so that the units could be changed.
library(tidyverse)
df %>%
mutate(activity_datetime = strptime(activity_datetime, '%m-%d-%Y %H:%M')) %>%
arrange(ID, status) %>%
group_by(ID, status) %>%
filter(status %in% c("In Progress", "Completed")) %>%
slice(1) %>%
group_by(ID) %>%
summarize(activity = abs(diff(activity_datetime))) %>%
mutate(activity = as.numeric(activity, units = 'mins'))
Output
ID activity
<int> <drtn>
1 123456 22 mins
2 234567 3172 mins
CodePudding user response:
You can join the completed times, then take the difference. After that a grouped filter to get row with maximum difference in time per ID.
library(tidyverse)
# data:
d <- structure(list(ID = c(123456, 123456, 123456, 123456, 234567, 234567, 234567, 234567), status = c("Account Ready", "In Progress", "Additional Documentation Needed", "Completed", "In Progress", "Query Pending", "In Progress", "Completed"), datetime = structure(c(1650868200, 1650868200, 1650869280, 1650869520, 1650702300, 1650702600, 1650891600, 1650892620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
d |>
filter(status == "Completed") |>
rename(completed = datetime) |>
select(-status) |>
right_join(d) |>
mutate(diff = completed - datetime) |>
group_by(ID) |>
slice(which.max(diff))
#> Joining, by = "ID"
#> # A tibble: 2 × 5
#> # Groups: ID [2]
#> ID completed status datetime diff
#> <dbl> <dttm> <chr> <dttm> <drtn>
#> 1 123456 2022-04-25 06:52:00 Account Ready 2022-04-25 06:30:00 1320 secs
#> 2 234567 2022-04-25 13:17:00 In Progress 2022-04-23 08:25:00 190320 secs
Created on 2022-04-26 by the reprex package (v2.0.1)