I have a text block as below:
group 1
name A
name B
name C
group 2
name X
name Y
name Z
group 3
name I
name II
name III
It's a nest groups, (group 1, group 2, group 3,...), and each group contains sub group, (name x, name y,name z,...). How can I use regex to match this nested groups and sub groups with python?
I try below find all pattern to match outer groups fine, but how to add more code to match sub groups as well?
import re
content = """
group 1
name A
name B
name C
group 2
name X
name Y
name Z
group 3
name I
name II
name III
"""
m = re.findall("(group \d ).*?name",content,re.DOTALL)
if m:
for e in m:
print(e)
output:
: group 1
: group 2
: group 3
expected output:
: group 1
: name A
: name B
: name C
: group 2
: name X
: name Y
: name Z
: group 3
: name I
: name II
: name III
Note, groups and sub groups number may not the same!
CodePudding user response:
To get the expected output, you could put spaces before name
items.
print(re.sub(r"(?=name)", " ", content).strip())
See the regex demo
Or if you want to capture only name
items after group
, then use
(group \d ). ?(.*?)(?=.group|$)
See the regex demo
matches = re.findall(r"(group \d ). ?(.*?)(?=.group|$)", content, re.S) or []
for group, items in matches:
print(group, end="\n ")
print(*items.split("\n"), sep="\n ")
The second solution is for more dynamic use as matches
is a list of (group, items)
tuples, and we can easily convert it to a dict
and get the items
of only the needed group.
groups = dict(matches)
print(groups.get("group 1"))
CodePudding user response:
REGEX is not a must here. You can try the following code:
for i in content.split("\n"):
print(i) if "group" in i else print(f" {i}")
CodePudding user response:
A group starts with the word group
and ends just before the next group
or the end of the input. Use a lookahead in the regex to check if group
or end of string would match next.
pattern = r"(group\s\d .*?)(?=group|\Z)"
Use the re.DOTALL
flag so the .
in the pattern will match newlines.
To indent the subgroups, use .replace()
method or split()
and join()
methods:
for match in re.finditer(pattern, content, re.DOTALL):
print(match[0].strip().replace('\n', '\n '))
or
for match in re.finditer(pattern, content, re.DOTALL):
print('\n'.join(match[0].strip().split()))