I have data from an experiment where Subject
s rated an Event
called f
:
df <- structure(list(Subject = c("A", "A", "A", "B", "B", "B"),
Timestamp = c("00:00:00.146 - 00:00:00.889",
"00:00:01.568 - 00:00:02.183", "00:00:03.642 - 00:00:04.522",
"00:00:00.000 - 00:00:00.660", "00:00:01.247 - 00:00:02.229",
"00:00:03.697 - 00:00:04.926"),
Starttime_ms = c(146, 1568, 3642, 0, 1247, 3697),
Endtime_ms = c(889, 2183, 4522, 660, 2229, 4926),
Duration = c(743, 615, 880, 660, 982, 1229),
Event = c("f", "f", "f", "f", "f", "f")), row.names = c(NA, 6L), class = "data.frame")
What is missing are the respective measurements for those in-between timespans where the Subject
s did not rate the event. That's an event too, just a negative one; let's call it nf
. How can I insert rows for those missing timespans to obtain the complete record including both positive and negative events f
and nf
shown below?
Expected:
Subject Timestamp Starttime_ms Endtime_ms Duration Event
1 A 00:00:00.000 - 00:00:00.146 0 146 146 nf
2 A 00:00:00.146 - 00:00:00.889 146 889 743 f
3 A 00:00:00.889 - 00:00:01.568 889 1568 679 nf
4 A 00:00:01.568 - 00:00:02.183 1568 2183 615 f
5 A 00:00:02.183 - 00:00:03.642 2183 3642 1459 nf
6 A 00:00:03.642 - 00:00:04.522 3642 4522 880 f
7 B 00:00:00.000 - 00:00:00.660 0 660 660 f
8 B 00:00:00.660 - 00:00:01.247 660 1247 587 nf
9 B 00:00:01.247 - 00:00:02.229 1247 2229 982 f
10 B 00:00:02.229 - 00:00:03.697 2229 3697 1468 nf
11 B 00:00:03.697 - 00:00:04.926 3697 4926 1229 f
I've experimented with data.table
but since I'm not fluent with this, this attempt only brought partial success:
unq <- c(0, sort(unique(setDT(df)[, c(Starttime_ms, Endtime_ms)])))
df[.(unq[-length(unq)], unq[-1]), on=c("Starttime_ms", "Endtime_ms")]
While I'm open to any solution my preference is for a dplyr
one.
CodePudding user response:
This is a tidyverse
solution.
library(tidyverse)
library(lubridate)
df %>%
separate(Timestamp, c("start", "end"), sep = " - ") %>%
group_by(Subject) %>%
mutate(Starttime_ms = lag(end, default = "00:00:00.000"),
Endtime_ms = start,
Event = "nf",
across(ends_with("_ms"), ~ hms(.x) * 1000),
Duration = Endtime_ms - Starttime_ms,
across(where(is.period), as.numeric)) %>%
unite(Timestamp, start, end, sep = " - ") %>%
bind_rows(df) %>%
arrange(Starttime_ms, .by_group = TRUE) %>%
filter(Duration > 0) %>%
ungroup()
Output
# A tibble: 11 × 6
Subject Timestamp Starttime_ms Endtime_ms Duration Event
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 A 00:00:00.146 - 00:00:00.889 0 146 146 nf
2 A 00:00:00.146 - 00:00:00.889 146 889 743 f
3 A 00:00:01.568 - 00:00:02.183 889 1568 679 nf
4 A 00:00:01.568 - 00:00:02.183 1568 2183 615 f
5 A 00:00:03.642 - 00:00:04.522 2183 3642 1459 nf
6 A 00:00:03.642 - 00:00:04.522 3642 4522 880 f
7 B 00:00:00.000 - 00:00:00.660 0 660 660 f
8 B 00:00:01.247 - 00:00:02.229 660 1247 587 nf
9 B 00:00:01.247 - 00:00:02.229 1247 2229 982 f
10 B 00:00:03.697 - 00:00:04.926 2229 3697 1468 nf
11 B 00:00:03.697 - 00:00:04.926 3697 4926 1229 f
CodePudding user response:
You were nearly there. The issue is to complete the time sequence separately for each group. This is what would do:
library(data.table)
# create sequence of complete periods for each Subject
unq <- setDT(df)[, {
tmp <- sort(unique(c(0, Starttime_ms, Endtime_ms)))
list(Starttime_ms = head(tmp, -1L),
Endtime_ms = tail(tmp, -1L))
}, by = Subject]
# join and complete missing values (except Timestamp)
df[unq, on = .(Subject, Starttime_ms, Endtime_ms)][
is.na(Event), c("Duration", "Event") := .(Endtime_ms - Starttime_ms, "nf")][]
Subject Timestamp Starttime_ms Endtime_ms Duration Event 1: A <NA> 0 146 146 nf 2: A 00:00:00.146 - 00:00:00.889 146 889 743 f 3: A <NA> 889 1568 679 nf 4: A 00:00:01.568 - 00:00:02.183 1568 2183 615 f 5: A <NA> 2183 3642 1459 nf 6: A 00:00:03.642 - 00:00:04.522 3642 4522 880 f 7: B 00:00:00.000 - 00:00:00.660 0 660 660 f 8: B <NA> 660 1247 587 nf 9: B 00:00:01.247 - 00:00:02.229 1247 2229 982 f 10: B <NA> 2229 3697 1468 nf 11: B 00:00:03.697 - 00:00:04.926 3697 4926 1229 f
I am still unsure what the best way is to complete Timestamp
.
unq
contains all periods for each subject to join on:
unq
Subject Starttime_ms Endtime_ms 1: A 0 146 2: A 146 889 3: A 889 1568 4: A 1568 2183 5: A 2183 3642 6: A 3642 4522 7: B 0 660 8: B 660 1247 9: B 1247 2229 10: B 2229 3697 11: B 3697 4926