Home > front end >  R Concatenate Across Rows Within Groups but Preserve Sequence
R Concatenate Across Rows Within Groups but Preserve Sequence

Time:11-06

My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:

dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
          "We're wasting time",
          "Let's make a record!",
          "Let's work it out first",
          "Why?",
          "It goes like this",
          "Hold on",
          "Have to tighten my snare",
          "Ready?")

dat <- data.frame(dyad, speaker, text)

And this is what I'd like the data to look like:

  dyad speaker                                                text
1      1    John Let's play. We're wasting time. Let's make a record!
2      1    Paul                              Let's work it out first
3      1    John                                                 Why?
4      2  George                                    It goes like this
5      2   Ringo                    Hold on. Have to tighten my snare
6      2  George                                               Ready?

I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.

Thanks in advance

CodePudding user response:

Create another group with rleid (from data.table) and paste the rows in summarise

library(dplyr)
library(data.table)
library(stringr)
dat %>% 
   group_by(dyad, grp = rleid(speaker), speaker) %>% 
   summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>% 
   select(-grp)

-output

# A tibble: 6 × 3
   dyad speaker text                                              
  <dbl> <chr>   <chr>                                             
1     1 John    Let's play We're wasting time Let's make a record!
2     1 Paul    Let's work it out first                           
3     1 John    Why?                                              
4     2 George  It goes like this                                 
5     2 Ringo   Hold on Have to tighten my snare                  
6     2 George  Ready?                                         

CodePudding user response:

Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:

library(dplyr)
dat %>% 
  mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
         helper = cumsum(helper)) %>% 
  group_by(dyad, speaker, helper) %>% 
  summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>% 
  select(-helper)
     dyad speaker text                                              
  <dbl> <chr>   <chr>                                             
1     1 John    Let's play We're wasting time Let's make a record!
2     1 John    Why?                                              
3     1 Paul    Let's work it out first                           
4     2 George  It goes like this                                 
5     2 George  Ready?                                            
6     2 Ringo   Hold on Have to tighten my snare 
  • Related