In R, is there a way to get the difference between two rows based on a common value and criteria app-CodePudding

In my report, each line has an ID, a status and an activity datetime; from that, I need to figure out duration between two relevant statuses for one ID. For example, the data might be as shown below:

ID | status | activitydatetime
123456 | Account Ready | 04-25-2022 06:30
123456 | In Progress | 04-25-2022 06:30
123456 | Additional Documentation Needed | 04-25-2022 06:48
123456 | Completed | 04-25-2022 06:52
234567| In Progress | 04-23-2022 08:25
234567| Query Pending | 04-23-2022 08:30
234567 | In Progress | 04-25-2022 13:00
234567| Completed | 04-25-2022 13:17

From that data, I need to find the duration between In Progress and Completed. I'm using R for this but just can't figure out the best way to do this. The biggest hurdle is writing a script that will run by acct num for each status; i.e., "for 123456, datediff(status = In Progress, status = Completed)" and then go on to the next acct num - but a for loop doesn't feel like the cleanest way to do this.

I've managed to get this working:

library(dplyr)
# duration of all activity per account
activity <- df %>%
group_by(ID) %>%
summarize(
    first_activity = min(activitydatetime),
    last_activity = max(activitydatetime),
    activity = last_activity - first_activity)

The issue is the report doesn't always start with status = In Progress, so I still need to be able to adjust my script to look at status...like, first_activity = the activitydatetime for the first time status = In Progress for ID 123456.

Unfortunately, doing the following gives me the literal first and last activitydatetime for the entire report.

activity <- df %>%
group_by(ID) %>%
summarize(
    first_activity = df[first(which(df$status %in% "In Progress")),]$activitydatetime,
    last_activity = df[last(which(df$status %in% "Completed")),]$activitydatetime,
    activity = last_activity - first_activity)

Thanks for taking a look!

CodePudding user response：

One option would be to first filter just to the needed rows. I also use slice to take the earliest instance of In Progress. I did this because there are 2 instances of In Progress for ID = 234567. Then, we can summarize using to get the difference between the two rows. Then, I added a mutate statement so that the units could be changed.

library(tidyverse)

df %>%
  mutate(activity_datetime = strptime(activity_datetime, '%m-%d-%Y %H:%M')) %>%
  arrange(ID, status) %>%
  group_by(ID, status) %>%
  filter(status %in% c("In Progress", "Completed")) %>%
  slice(1) %>%
  group_by(ID) %>%
  summarize(activity = abs(diff(activity_datetime))) %>%
  mutate(activity = as.numeric(activity, units = 'mins'))

Output

      ID activity 
   <int> <drtn>   
1 123456   22 mins
2 234567 3172 mins

CodePudding user response：

You can join the completed times, then take the difference. After that a grouped filter to get row with maximum difference in time per ID.

library(tidyverse)

# data:
d <- structure(list(ID = c(123456, 123456, 123456, 123456, 234567, 234567, 234567, 234567), status = c("Account Ready", "In Progress", "Additional Documentation Needed", "Completed", "In Progress", "Query Pending", "In Progress", "Completed"), datetime = structure(c(1650868200, 1650868200, 1650869280, 1650869520, 1650702300, 1650702600, 1650891600, 1650892620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))

d |> 
  filter(status == "Completed") |> 
  rename(completed = datetime) |> 
  select(-status) |> 
  right_join(d) |> 
  mutate(diff = completed - datetime) |> 
  group_by(ID) |> 
  slice(which.max(diff))
#> Joining, by = "ID"
#> # A tibble: 2 × 5
#> # Groups:   ID [2]
#>       ID completed           status        datetime            diff       
#>    <dbl> <dttm>              <chr>         <dttm>              <drtn>     
#> 1 123456 2022-04-25 06:52:00 Account Ready 2022-04-25 06:30:00   1320 secs
#> 2 234567 2022-04-25 13:17:00 In Progress   2022-04-23 08:25:00 190320 secs

^{Created on 2022-04-26 by the reprex package (v2.0.1)}