Home > OS >  Compute the amount of time between time stamps depending on a group var in dplyr
Compute the amount of time between time stamps depending on a group var in dplyr

Time:05-10

I've been racking my brain at the following issue.

Let's say I have transcript data from speakers. Every row is a particular utterance from a speaker (given by speaker_id) and when they started talking, Timestamp. Sometimes consecutive rows will contain utterances from the same speaker because they continue speaking after a pause.

I'd like to figure out the total amount of time the speakers in the dataset have spoken, which requires that I compute the difference between the start of a timestamp and the beginning of the first timestamp of the next speaker. How would I do this?

Here's a sample of the dataset

file = read.table(text = "speaker_id,Timestamp          
5,2022-03-30 03:00:00
5,2022-03-30 03:00:24
3,2022-03-30 03:00:52
3,2022-03-30 03:00:56
3,2022-03-30 03:00:58
5,2022-03-30 03:01:25
5,2022-03-30 03:02:15
3,2022-03-30 03:03:14
5,2022-03-30 03:03:36
3,2022-03-30 03:04:26
3,2022-03-30 03:06:02
3,2022-03-30 03:06:10
5,2022-03-30 03:06:28
5,2022-03-30 03:07:28
3,2022-03-30 03:08:56
5,2022-03-30 03:09:11
5,2022-03-30 03:10:02
5,2022-03-30 03:10:56
3,2022-03-30 03:11:53
5,2022-03-30 03:12:20", header = T, sep = ",")

Any ideas?

CodePudding user response:

Here is an approach using {dplyr}:

  1. Transform the Timestamp column to as.POSIXct
  2. Calculate the difference between the current and the next timestamp with dplyr::lead
  3. Group by speaker_id
  4. Sum up duration per speaker. Use na.rm = TRUE since the duration of the last row will be missing/NA
library(dplyr)

file %>% 
  mutate(
    Timestamp = as.POSIXct(Timestamp),
    duration = lead(Timestamp) - Timestamp) %>% 
  group_by(speaker_id) %>% 
  summarize(total_duration = sum(duration, na.rm = TRUE))

CodePudding user response:

One option could be:

file %>%
    mutate(Timestamp2 = lead(Timestamp, default = last(Timestamp))) %>%
    group_by(speaker_id,
             rleid = cumsum(speaker_id != lag(speaker_id, default = first(speaker_id)))) %>%
    mutate(Timestamp_diff = last(ymd_hms(Timestamp2)) - first(ymd_hms(Timestamp)))

  speaker_id Timestamp           Timestamp2          rleid Timestamp_diff
        <int> <chr>               <chr>               <int> <drtn>        
 1          5 2022-03-30 03:00:00 2022-03-30 03:00:24     0  52 secs      
 2          5 2022-03-30 03:00:24 2022-03-30 03:00:52     0  52 secs      
 3          3 2022-03-30 03:00:52 2022-03-30 03:00:56     1  33 secs      
 4          3 2022-03-30 03:00:56 2022-03-30 03:00:58     1  33 secs      
 5          3 2022-03-30 03:00:58 2022-03-30 03:01:25     1  33 secs      
 6          5 2022-03-30 03:01:25 2022-03-30 03:02:15     2 109 secs      
 7          5 2022-03-30 03:02:15 2022-03-30 03:03:14     2 109 secs      
 8          3 2022-03-30 03:03:14 2022-03-30 03:03:36     3  22 secs      
 9          5 2022-03-30 03:03:36 2022-03-30 03:04:26     4  50 secs      
10          3 2022-03-30 03:04:26 2022-03-30 03:06:02     5 122 secs      
11          3 2022-03-30 03:06:02 2022-03-30 03:06:10     5 122 secs      
12          3 2022-03-30 03:06:10 2022-03-30 03:06:28     5 122 secs      
13          5 2022-03-30 03:06:28 2022-03-30 03:07:28     6 148 secs      
14          5 2022-03-30 03:07:28 2022-03-30 03:08:56     6 148 secs      
15          3 2022-03-30 03:08:56 2022-03-30 03:09:11     7  15 secs      
16          5 2022-03-30 03:09:11 2022-03-30 03:10:02     8 162 secs      
17          5 2022-03-30 03:10:02 2022-03-30 03:10:56     8 162 secs      
18          5 2022-03-30 03:10:56 2022-03-30 03:11:53     8 162 secs      
19          3 2022-03-30 03:11:53 2022-03-30 03:12:20     9  27 secs      
20          5 2022-03-30 03:12:20 2022-03-30 03:12:20    10   0 secs
  • Related