I have transcriptions of interviews that are partly irregularly formed:
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
What I need to do is structure this data by extracting its key elements into columns of a dataframe. There are four such key elements:
Role
in interview: interviewee or interviewerUtterance
: the interview partners' speechTimestamp
indicated by#
to both endsGap
indicated by decimal number in brackets
The problem is that both Timestamp
and Gap
are inconsistently provided. While I can make the last capture group for Gap
optional, those strings that have neither Timestamp
nor Gap
are not rendered correctly:
I'm using extract
from tidyr
for the extraction:
library(tidyr)
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s )([\\S\\s] ?)\\s*#([^#] )?#\\s*(\\([0-9.] \\))?\\s*")
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA>
How can the regex be refined so that I get this desired output:
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
CodePudding user response:
An alternative to a complex regex is to use multiple extracts with simpler regexes. Afterwards convert any NA's to "" and strip unwanted whitespace.
library(dplyr)
library(tidyr)
data.frame(tst) %>%
extract(tst, "Gap", "(\\(.*?\\))", remove = FALSE) %>%
extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
extract(tst, c("Role", "Utterance"), "^(\\S :|)([^#]*)") %>%
mutate(across(, coalesce, ""), Utterance = trimws(Utterance))
giving:
Role Utterance Timestamp Gap
1 In: ja COOL; #00:04:24-6#
2 in den vier, FÜNF wochen, #00:04:57-8#
3 In: jah, #00:02:07-8#
4 In: [ja; ] #00:03:25-5#
5 also jA:h; #00:03:16-6# (1.1)
6 Bz: [E::hm; ] #00:03:51-4# (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
CodePudding user response:
You could update your pattern to use your 4 capture groups, and make the last part optional by optionally matching the 3rd group and then the 4th group and assert the end of the string:
library(tidyr)
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s )([\\s\\S]*?)(?:\\s*#([^#] )(?:#\\s*(\\([0-9.] \\))?\\s*)?)?$")
Output
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] #00:03:25-5# [ja; ] 00:03:26-1
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;