I have some time course data of three people having a conversation. Among other stuff like pitch and intensity, I have the timing information and who was speaking at this point - Person L, Person R, Person B or no one ("0"). A short example looks like this where t is the time in seconds and s is the speaker information:
> t = 1:10
> s = c("L", "0", "L", "0", "R", "B", "R", "0", "0", "L")
> data.frame(t,s)
t a
1 1 L
2 2 0
3 3 L
4 4 0
5 5 R
6 6 B
7 7 R
8 8 0
9 9 0
10 10 L
I would like to add information about the speech turns to the data. One turn is one person speaking including pauses until someone else starts speaking. In the specific example above the goal is the following:
t a goal
1 1 L L1
2 2 0 L1
3 3 L L1
4 4 0 L1
5 5 R R1
6 6 B B1
7 7 R R2
8 8 0 R2
9 9 0 R2
10 10 L L2
I know how to do this with a for loop, however, my data has 600000 rows so that would be super slow. Does anyone have an idea how one could accomplish something like this?
CodePudding user response:
A lot of functions, but quite straightforward:
- replace 0 with NAs and
fill
NAs with the most upwards non-NA value. - create a
row_number
row group_by
s and create groups based on consecutiverow_number
s (this is the main function!)paste
s and gp to the desired value.
library(tidyverse)
data.frame(t, s) %>%
mutate(snew = na_if(s, "0"),
rown = row_number()) %>%
fill(snew) %>%
group_by(snew) %>%
mutate(gp = cumsum(c(TRUE, diff(rown) > 1)), .keep = "unused") %>%
ungroup() %>%
mutate(goal = paste0(snew, gp), .keep = "unused")
t s goal
1 1 L L1
2 2 0 L1
3 3 L L1
4 4 0 L1
5 5 R R1
6 6 B B1
7 7 R R2
8 8 0 R2
9 9 0 R2
10 10 L L2
CodePudding user response:
Your key vector of interest:
s <- c("L", "0", "L", "0", "R", "B", "R", "0", "0", "L")
A base R solution:
## fill "0" using vectorized "last observation carried forward"
zero <- which(s == "0")
logi <- c(TRUE, diff(zero) > 1)
s[zero] <- rep(s[zero[logi] - 1], tabulate(cumsum(logi)))
## generate numeric ID
ID <- with(rle(s), rep(ave(values, values, FUN = seq_along), lengths))
## final `paste0`
paste0(s, ID)
#[1] "L1" "L1" "L1" "L1" "R1" "B1" "R2" "R2" "R2" "L2"
CodePudding user response:
Using data.table
:
library(data.table)
chnalocf = \(x) x[nafill(replace(seq_along(x), is.na(x), NA), "locf")]
setDT(df)
df[, s2 := chnalocf(replace(x, x == "0", NA))
][, tmp := rleid(s2)
][, goal := paste0(s2, rleid(tmp)), by = s2
][, !c("s2", "tmp")]
# t s goal
# <int> <char> <char>
# 1: 1 L L1
# 2: 2 0 L1
# 3: 3 L L1
# 4: 4 0 L1
# 5: 5 R R1
# 6: 6 B B1
# 7: 7 R R2
# 8: 8 0 R2
# 9: 9 0 R2
# 10: 10 L L2