Home > Software engineering >  Separate strings into rows unless between sets of delimiters
Separate strings into rows unless between sets of delimiters

Time:12-02

I have utterances with annotation symbols:

utt <- c("↑hey girls↑ can I <join yo:u>", "((v: grunts))", "!damn shit! got it", 
"I mean /yeah we saw each other at a party:/↓ the other day"
)

I need to split utt into separate words unless the words are enclosed by certain delimiters, including this class [(/≈↑£<>°!]. I'm doing reasonably well using double negative lookahead for utts where only one such string between delimiters occurs; but I'm failing to split correctly where there are multiple such strings between delimiters:

library(tidyr)
library(dplyr)
data.frame(utt2) %>%
  separate_rows(utt, sep = "(?!.*[(/≈↑£<>°!].*)\\s(?!.*[)/≈↑£<>°!])")
# A tibble: 9 × 1
  utt2                                        
  <chr>                                       
1 ↑hey girls↑ can I <join yo:u>               
2 ((v: grunts))                               
3 !damn shit!                                 
4 got                                         
5 it                                          
6 I mean /yeah we saw each other at a party:/7 the                                         
8 other                                       
9 day 

The expected result would be:

1 ↑hey girls↑ 
2 can
3 I
4 <join yo:u>               
5 ((v: grunts))                               
6 !damn shit!                                 
7 got                                         
8 it                                          
9 I
10 mean 
11 /yeah we saw each other at a party:/12 the                                         
13 other                                       
14 day 

CodePudding user response:

You can use

data.frame(utt2) %>% separate_rows(utt2, sep = "(?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F)|\\s ")

See the regex demo.

Note that in your case, there are chars that are paired (like ( and ), < and >) and non-paired chars (like , £). They require different handling reflected in the pattern.

Details:

  • (?:([/≈↓£°!↑]).*?\\1|\\([^()]*\\)|<[^<>]*>)(*SKIP)(*F) matches
    • ([/≈↓£°!↑]).*?\1| - a /, , , £, ° or ! char captured into Group 1, then any zero or more chars other than line break chars as few as possible (see .*?) and then the same char as captured into Group 1
    • \([^()]*\)| - (, zero or more chars other than ( and ) and then a ) char, or
    • <[^<>]*> - <, zero or more chars other than < and > and then a > char
    • (*SKIP)(*F) - skip the matched text and restart a new search from the failure position
  • | - or
  • \s - one or more whitespaces in any other context.
  • Related