I have utterances with annotation symbols:
utt <- c("↑hey girls↑ can I <join yo:u>", "((v: grunts))", "!damn shit! got it",
"I mean /yeah we saw each other at a party:/↓ the other day"
)
I need to split utt
into separate words unless the words are enclosed by certain delimiters, including this class [(/≈↑£<>°!]
. I'm doing reasonably well using double negative lookahead for utt
s where only one such string between delimiters occurs; but I'm failing to split correctly where there are multiple such strings between delimiters:
library(tidyr)
library(dplyr)
data.frame(utt2) %>%
separate_rows(utt, sep = "(?!.*[(/≈↑£<>°!].*)\\s(?!.*[)/≈↑£<>°!])")
# A tibble: 9 × 1
utt2
<chr>
1 ↑hey girls↑ can I <join yo:u>
2 ((v: grunts))
3 !damn shit!
4 got
5 it
6 I mean /yeah we saw each other at a party:/↓
7 the
8 other
9 day
The expected result would be:
1 ↑hey girls↑
2 can
3 I
4 <join yo:u>
5 ((v: grunts))
6 !damn shit!
7 got
8 it
9 I
10 mean
11 /yeah we saw each other at a party:/↓
12 the
13 other
14 day
CodePudding user response:
You can use
data.frame(utt2) %>% separate_rows(utt2, sep = "(?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F)|\\s ")
See the regex demo.
Note that in your case, there are chars that are paired (like (
and )
, <
and >
) and non-paired chars (like ↑
, £
). They require different handling reflected in the pattern.
Details:
(?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F)
matches([/≈↓£°!↑]).*?\1|
- a/
,≈
,↑
,£
,°
or!
char captured into Group 1, then any zero or more chars other than line break chars as few as possible (see.*?
) and then the same char as captured into Group 1\([^()]*\)|
-(
, zero or more chars other than(
and)
and then a)
char, or<[^<>]*>
-<
, zero or more chars other than<
and>
and then a>
char(*SKIP)(*F)
- skip the matched text and restart a new search from the failure position
|
- or\s
- one or more whitespaces in any other context.