Capture sequence of words separated by whitespace thru existing regex-CodePudding

Following up from an earlier version of this question asked here.

I have a string of the form --

test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"

which needs to be converted to a dictionary (<str>:[List]) of the form:

{"some fruits:" : ["apples, oranges", "also pineapple"], "some animals:" ["dogs, cats"]}

Everything between two <mid> tags is a single string, whereas multiple <mid> tags followed by <start> mean different strings.

Currently, my regex (from the post linked above) looks like this

res = re.finditer(r'<start>\s(\w )\s<mid>\s(\w (?:\s<mid>\s\w )*), test)'

which can then be iterated over to create a dictionary --

test_dict = {}
for match in res:
    test_dict[match.group(1)] = match.group(2).split(' <mid> ')

However, I am unable to capture multiple words between <start>/<mid>/<mid> tags (i.e. separated by whitespace, comma etc).

How can this regex be formatted to capture everything between multiple <> tags?

CodePudding user response：

You could use re.findall:

data = {}
for m in re.findall(r'(<\w >)\s ([^<] )', test):
    if m[0] == '<start>':
        l = data.setdefault(m[1].strip(), [])
    elif m[0] == '<mid>':
        l.append(m[1].strip())

Output:

>>> data
{'some fruits:': ['apple, oranges', 'also pineapple'],
 'some animals:': ['dogs, cats']}

CodePudding user response：

For a non-regex approach, you can strip out the <bos> and <eos> tags since you know that you want a dict object, and then use str.split to split on specific tags. Note that I am only using this approach as you don't have end tags to keep a track of, like </mid>.

from collections import defaultdict

test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"

test = test[5:-5].strip()

result = defaultdict(list)

for item in test.split('<start>'):
    if not item:
        continue

    key, *vals = item.strip().split('<mid>')
    result[key].extend(val.strip() for val in vals)

print(result)

Result:

defaultdict(<class 'list'>, {'some fruits: ': ['apple, oranges', 'also pineapple'], 'some animals: ': ['dogs, cats']})

Possibly you could make do with a regex approach too, but haven't tested with one yet.