I've been racking my brain at the following issue.
Let's say I have transcript data from speakers. Every row is a particular utterance from a speaker (given by speaker_id
) and when they started talking, Timestamp
. Sometimes consecutive rows will contain utterances from the same speaker because they continue speaking after a pause.
I'd like to figure out the total amount of time the speakers in the dataset have spoken, which requires that I compute the difference between the start of a timestamp and the beginning of the first timestamp of the next speaker. How would I do this?
Here's a sample of the dataset
file = read.table(text = "speaker_id,Timestamp
5,2022-03-30 03:00:00
5,2022-03-30 03:00:24
3,2022-03-30 03:00:52
3,2022-03-30 03:00:56
3,2022-03-30 03:00:58
5,2022-03-30 03:01:25
5,2022-03-30 03:02:15
3,2022-03-30 03:03:14
5,2022-03-30 03:03:36
3,2022-03-30 03:04:26
3,2022-03-30 03:06:02
3,2022-03-30 03:06:10
5,2022-03-30 03:06:28
5,2022-03-30 03:07:28
3,2022-03-30 03:08:56
5,2022-03-30 03:09:11
5,2022-03-30 03:10:02
5,2022-03-30 03:10:56
3,2022-03-30 03:11:53
5,2022-03-30 03:12:20", header = T, sep = ",")
Any ideas?
CodePudding user response:
Here is an approach using {dplyr}:
- Transform the Timestamp column to as.POSIXct
- Calculate the difference between the current and the next timestamp with
dplyr::lead
- Group by speaker_id
- Sum up duration per speaker. Use na.rm = TRUE since the duration of the last row will be missing/NA
library(dplyr)
file %>%
mutate(
Timestamp = as.POSIXct(Timestamp),
duration = lead(Timestamp) - Timestamp) %>%
group_by(speaker_id) %>%
summarize(total_duration = sum(duration, na.rm = TRUE))
CodePudding user response:
One option could be:
file %>%
mutate(Timestamp2 = lead(Timestamp, default = last(Timestamp))) %>%
group_by(speaker_id,
rleid = cumsum(speaker_id != lag(speaker_id, default = first(speaker_id)))) %>%
mutate(Timestamp_diff = last(ymd_hms(Timestamp2)) - first(ymd_hms(Timestamp)))
speaker_id Timestamp Timestamp2 rleid Timestamp_diff
<int> <chr> <chr> <int> <drtn>
1 5 2022-03-30 03:00:00 2022-03-30 03:00:24 0 52 secs
2 5 2022-03-30 03:00:24 2022-03-30 03:00:52 0 52 secs
3 3 2022-03-30 03:00:52 2022-03-30 03:00:56 1 33 secs
4 3 2022-03-30 03:00:56 2022-03-30 03:00:58 1 33 secs
5 3 2022-03-30 03:00:58 2022-03-30 03:01:25 1 33 secs
6 5 2022-03-30 03:01:25 2022-03-30 03:02:15 2 109 secs
7 5 2022-03-30 03:02:15 2022-03-30 03:03:14 2 109 secs
8 3 2022-03-30 03:03:14 2022-03-30 03:03:36 3 22 secs
9 5 2022-03-30 03:03:36 2022-03-30 03:04:26 4 50 secs
10 3 2022-03-30 03:04:26 2022-03-30 03:06:02 5 122 secs
11 3 2022-03-30 03:06:02 2022-03-30 03:06:10 5 122 secs
12 3 2022-03-30 03:06:10 2022-03-30 03:06:28 5 122 secs
13 5 2022-03-30 03:06:28 2022-03-30 03:07:28 6 148 secs
14 5 2022-03-30 03:07:28 2022-03-30 03:08:56 6 148 secs
15 3 2022-03-30 03:08:56 2022-03-30 03:09:11 7 15 secs
16 5 2022-03-30 03:09:11 2022-03-30 03:10:02 8 162 secs
17 5 2022-03-30 03:10:02 2022-03-30 03:10:56 8 162 secs
18 5 2022-03-30 03:10:56 2022-03-30 03:11:53 8 162 secs
19 3 2022-03-30 03:11:53 2022-03-30 03:12:20 9 27 secs
20 5 2022-03-30 03:12:20 2022-03-30 03:12:20 10 0 secs