In this data of Speaker
s and their Utterance
s:
df <- data.frame(
Line = 1:15,
Speaker = c("ID01.A", NA, "ID01.B",
"ID17.A", NA,
"ID27.B", NA, "ID27.B", NA, "ID27.B", "ID27.A",
"ID27.C",
"ID33.B", "ID33.A", "ID33.C"),
Utterance = c("Who did it?", NA, "Peter did.",
"Hello!", "(1.11)",
"Did you", "(1.2)", "erm", "(0.9)", "go [there]?", "[heck] yeah",
"wow!",
"[When] you're coming?", "[that's]", "Yes, sure."),
Sequ = c(1,1,1,
NA, NA,
2,2,2,2,2,2,
NA,
3,3,3),
Q = c("q_wh", "", "",
NA, NA,
"q_pol", "", "", "", "", "",
NA,
"q_wh", "", ""))
I need to count the number of Utterance
s by each Speaker prior to Speaker
change and without the pauses (in round brackets such as (...)
) and NA
values in-between each Speaker
's contiguous series of Utterance
s. I'm able to count the number of Utterance
s by Speaker
and Sequ
ence but the count includes all in-between pauses and NA
:
library(dplyr)
library(tidyr)
df %>%
fill(Speaker, .direction = 'down') %>%
group_by(Speaker, Sequ) %>%
mutate(N_ipu = n())
# A tibble: 15 × 6
# Groups: Speaker, Sequ [9]
Line Speaker Utterance Sequ Q N_ipu
<int> <chr> <chr> <dbl> <chr> <int>
1 1 ID01.A Who did it? 1 "q_wh" 2
2 2 ID01.A NA 1 "" 2
3 3 ID01.B Peter did. 1 "" 1
4 4 ID17.A Hello! NA NA 2
5 5 ID17.A (1.11) NA NA 2
6 6 ID27.B Did you 2 "q_pol" 5
7 7 ID27.B (1.2) 2 "" 5
8 8 ID27.B erm 2 "" 5
9 9 ID27.B (0.9) 2 "" 5
10 10 ID27.B go [there]? 2 "" 5
11 11 ID27.A [heck] yeah 2 "" 1
12 12 ID27.C wow! NA NA 1
13 13 ID33.B [When] you're coming? 3 "q_wh" 1
14 14 ID33.A [that's] 3 "" 1
15 15 ID33.C Yes, sure. 3 "" 1
How can the pauses be excluded so that the final result is this:
# A tibble: 15 × 6
# Groups: Speaker, Sequ [9]
Line Speaker Utterance Sequ Q N_ipu
<int> <chr> <chr> <dbl> <chr> <int>
1 1 ID01.A Who did it? 1 "q_wh" 1
2 2 ID01.A NA 1 "" 1
3 3 ID01.B Peter did. 1 "" 1
4 4 ID17.A Hello! NA NA 1
5 5 ID17.A (1.11) NA NA 1
6 6 ID27.B Did you 2 "q_pol" 3
7 7 ID27.B (1.2) 2 "" 3
8 8 ID27.B erm 2 "" 3
9 9 ID27.B (0.9) 2 "" 3
10 10 ID27.B go [there]? 2 "" 3
11 11 ID27.A [heck] yeah 2 "" 1
12 12 ID27.C wow! NA NA 1
13 13 ID33.B [When] you're coming? 3 "q_wh" 1
14 14 ID33.A [that's] 3 "" 1
15 15 ID33.C Yes, sure. 3 "" 1
CodePudding user response:
We could flag all string in parenthesis with the regex '\\(([^\\)] )\\)'
in the column Utterance
and then sum.
library(dplyr)
library(tidyr)
df %>%
fill(Speaker, .direction = 'down') %>%
group_by(Speaker, Sequ) %>%
mutate(helper = ifelse(str_detect(Utterance, '\\(([^\\)] )\\)')
| is.na(Utterance), 0, 1)) %>%
mutate(N_ipu = sum(helper), .keep="unused")
Line Speaker Utterance Sequ Q N_ipu
<int> <chr> <chr> <dbl> <chr> <dbl>
1 1 ID01.A Who did it? 1 "q_wh" 1
2 2 ID01.A NA 1 "" 1
3 3 ID01.B Peter did. 1 "" 1
4 4 ID17.A Hello! NA NA 1
5 5 ID17.A (1.11) NA NA 1
6 6 ID27.B Did you 2 "q_pol" 3
7 7 ID27.B (1.2) 2 "" 3
8 8 ID27.B erm 2 "" 3
9 9 ID27.B (0.9) 2 "" 3
10 10 ID27.B go [there]? 2 "" 3
11 11 ID27.A [heck] yeah 2 "" 1
12 12 ID27.C wow! NA NA 1
13 13 ID33.B [When] you're coming? 3 "q_wh" 1
14 14 ID33.A [that's] 3 "" 1
15 15 ID33.C Yes, sure. 3 "" 1