Home > Software design >  Splitting words by whitespace without affecting brackets content using regex
Splitting words by whitespace without affecting brackets content using regex

Time:04-04

I'm trying to tokenize sentences using re in python like an example mentioned here:

I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]

I wish to tokenize by splitting them using whitespace but without affecting the bracket set. For example, I want the split list as:

["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]

How do I write the re.split expression to achieve the same.

CodePudding user response:

You can do this with the regex pattern: \s(?!\w \))

import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w \))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']

\s(?!\w \))
The above pattern will NOT match any space that is followed by a word and a ), basically any space inside ')'.

Test regex here: https://regex101.com/r/SRHEXO/1

Test python here: https://ideone.com/reIIcU

CodePudding user response:

Regular expressions, no matter how clever, are not always the right answer.

def split(s):
    result = []
    brace_depth = 0
    temp = ''
    for ch in s:
        if ch == ' ' and brace_depth == 0:
            result.append(temp[:])
            temp = ''
        elif ch == '(' or ch == '[':
            brace_depth  = 1
            temp  = ch
        elif ch == ']' or ch == ')':
            brace_depth -= 1
            temp  = ch
        else:
            temp  = ch
    if temp != '':
        result.append(temp[:])
    return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']

CodePudding user response:

The regex for string is \s. So using this with re.split:

print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))

The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']

  • Related