Home > OS >  Count number of utterances by same speakers but discount number of in-between pauses
Count number of utterances by same speakers but discount number of in-between pauses

Time:01-03

In this data of Speakers and their Utterances:

df <- data.frame(
  Line = 1:15,
  Speaker = c("ID01.A", NA, "ID01.B",                           
              "ID17.A", NA,                                     
              "ID27.B", NA, "ID27.B", NA, "ID27.B", "ID27.A",   
              "ID27.C",                                         
              "ID33.B", "ID33.A", "ID33.C"),                  
  
  Utterance = c("Who did it?", NA, "Peter did.",                                   
                "Hello!", "(1.11)",                                                 
                "Did you", "(1.2)", "erm", "(0.9)", "go [there]?", "[heck] yeah",   
                "wow!",                                                             
                "[When] you're coming?", "[that's]", "Yes, sure."),                 
  Sequ = c(1,1,1,
           NA, NA, 
           2,2,2,2,2,2,
           NA,
           3,3,3),
  Q = c("q_wh", "", "", 
        NA, NA, 
        "q_pol", "", "", "", "", "",
        NA,
        "q_wh", "", ""))

I need to count the number of Utterances by each Speaker prior to Speaker change and without the pauses (in round brackets such as (...)) and NA values in-between each Speaker's contiguous series of Utterances. I'm able to count the number of Utterances by Speaker and Sequence but the count includes all in-between pauses and NA:

library(dplyr)
library(tidyr)
df %>%
  fill(Speaker, .direction = 'down') %>%
  group_by(Speaker, Sequ) %>%
  mutate(N_ipu = n())
# A tibble: 15 × 6
# Groups:   Speaker, Sequ [9]
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <int>
 1     1 ID01.A  Who did it?               1 "q_wh"      2
 2     2 ID01.A  NA                        1 ""          2
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         2
 5     5 ID17.A  (1.11)                   NA  NA         2
 6     6 ID27.B  Did you                   2 "q_pol"     5
 7     7 ID27.B  (1.2)                     2 ""          5
 8     8 ID27.B  erm                       2 ""          5
 9     9 ID27.B  (0.9)                     2 ""          5
10    10 ID27.B  go [there]?               2 ""          5
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1

How can the pauses be excluded so that the final result is this:

# A tibble: 15 × 6
# Groups:   Speaker, Sequ [9]
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <int>
 1     1 ID01.A  Who did it?               1 "q_wh"      1
 2     2 ID01.A  NA                        1 ""          1
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         1
 5     5 ID17.A  (1.11)                   NA  NA         1
 6     6 ID27.B  Did you                   2 "q_pol"     3
 7     7 ID27.B  (1.2)                     2 ""          3
 8     8 ID27.B  erm                       2 ""          3
 9     9 ID27.B  (0.9)                     2 ""          3
10    10 ID27.B  go [there]?               2 ""          3
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1

CodePudding user response:

We could flag all string in parenthesis with the regex '\\(([^\\)] )\\)' in the column Utterance and then sum.

library(dplyr)
library(tidyr)
df %>%
  fill(Speaker, .direction = 'down') %>%
  group_by(Speaker, Sequ) %>%
  mutate(helper = ifelse(str_detect(Utterance, '\\(([^\\)] )\\)')
                         | is.na(Utterance), 0, 1)) %>% 
  mutate(N_ipu = sum(helper), .keep="unused")
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <dbl>
 1     1 ID01.A  Who did it?               1 "q_wh"      1
 2     2 ID01.A  NA                        1 ""          1
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         1
 5     5 ID17.A  (1.11)                   NA  NA         1
 6     6 ID27.B  Did you                   2 "q_pol"     3
 7     7 ID27.B  (1.2)                     2 ""          3
 8     8 ID27.B  erm                       2 ""          3
 9     9 ID27.B  (0.9)                     2 ""          3
10    10 ID27.B  go [there]?               2 ""          3
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1
  • Related