My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
CodePudding user response:
Create another group with rleid
(from data.table
) and paste
the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
CodePudding user response:
Not as elegant as dear akrun's solution. helper
does the same as rleid
function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare