Following up from an earlier version of this question asked here.
I have a string of the form --
test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"
which needs to be converted to a dictionary (<str>:[List])
of the form:
{"some fruits:" : ["apples, oranges", "also pineapple"], "some animals:" ["dogs, cats"]}
Everything between two <mid>
tags is a single string, whereas multiple <mid>
tags followed by <start>
mean different strings.
Currently, my regex (from the post linked above) looks like this
res = re.finditer(r'<start>\s(\w )\s<mid>\s(\w (?:\s<mid>\s\w )*), test)'
which can then be iterated over to create a dictionary --
test_dict = {}
for match in res:
test_dict[match.group(1)] = match.group(2).split(' <mid> ')
However, I am unable to capture multiple words between <start>/<mid>/<mid>
tags (i.e. separated by whitespace, comma etc).
How can this regex be formatted to capture everything between multiple <>
tags?
CodePudding user response:
You could use re.findall
:
data = {}
for m in re.findall(r'(<\w >)\s ([^<] )', test):
if m[0] == '<start>':
l = data.setdefault(m[1].strip(), [])
elif m[0] == '<mid>':
l.append(m[1].strip())
Output:
>>> data
{'some fruits:': ['apple, oranges', 'also pineapple'],
'some animals:': ['dogs, cats']}
CodePudding user response:
For a non-regex approach, you can strip out the <bos>
and <eos>
tags since you know that you want a dict
object, and then use str.split
to split on specific tags. Note that I am only using this approach as you don't have end tags to keep a track of, like </mid>
.
from collections import defaultdict
test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"
test = test[5:-5].strip()
result = defaultdict(list)
for item in test.split('<start>'):
if not item:
continue
key, *vals = item.strip().split('<mid>')
result[key].extend(val.strip() for val in vals)
print(result)
Result:
defaultdict(<class 'list'>, {'some fruits: ': ['apple, oranges', 'also pineapple'], 'some animals: ': ['dogs, cats']})
Possibly you could make do with a regex approach too, but haven't tested with one yet.