I'm trying to tokenize sentences using re
in python like an example mentioned here:
I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]
I wish to tokenize by splitting them using whitespace but without affecting the bracket set. For example, I want the split list as:
["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]
How do I write the re.split
expression to achieve the same.
CodePudding user response:
You can do this with the regex pattern: \s(?!\w \))
import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w \))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
\s(?!\w \))
The above pattern will NOT match any space that is followed by a word and a )
, basically any space inside ')'.
Test regex here: https://regex101.com/r/SRHEXO/1
Test python here: https://ideone.com/reIIcU
CodePudding user response:
Regular expressions, no matter how clever, are not always the right answer.
def split(s):
result = []
brace_depth = 0
temp = ''
for ch in s:
if ch == ' ' and brace_depth == 0:
result.append(temp[:])
temp = ''
elif ch == '(' or ch == '[':
brace_depth = 1
temp = ch
elif ch == ']' or ch == ')':
brace_depth -= 1
temp = ch
else:
temp = ch
if temp != '':
result.append(temp[:])
return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
CodePudding user response:
The regex for string is \s
. So using this with re.split
:
print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))
The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']