Home > Net >  match nested groups with regex
match nested groups with regex

Time:08-12

I have a text block as below:

group 1
name A
name B
name C
group 2
name X
name Y
name Z
group 3
name I
name II
name III

It's a nest groups, (group 1, group 2, group 3,...), and each group contains sub group, (name x, name y,name z,...). How can I use regex to match this nested groups and sub groups with python?

I try below find all pattern to match outer groups fine, but how to add more code to match sub groups as well?

import re

content = """
group 1
name A
name B
name C
group 2
name X
name Y
name Z
group 3
name I
name II
name III
"""

m = re.findall("(group \d ).*?name",content,re.DOTALL)
if m:
    for e in m:
        print(e)

output:

: group 1
: group 2
: group 3

expected output:

: group 1
:    name A
:    name B
:    name C
: group 2
:    name X
:    name Y
:    name Z
: group 3
:    name I
:    name II
:    name III

Note, groups and sub groups number may not the same!

CodePudding user response:

To get the expected output, you could put spaces before name items.

print(re.sub(r"(?=name)", "   ", content).strip())

See the regex demo

Or if you want to capture only name items after group, then use

(group \d ). ?(.*?)(?=.group|$)

See the regex demo

matches = re.findall(r"(group \d ). ?(.*?)(?=.group|$)", content, re.S) or []
for group, items in matches:
    print(group, end="\n   ")
    print(*items.split("\n"), sep="\n   ")

The second solution is for more dynamic use as matches is a list of (group, items) tuples, and we can easily convert it to a dict and get the items of only the needed group.

groups = dict(matches)
print(groups.get("group 1"))

CodePudding user response:

REGEX is not a must here. You can try the following code:

for i in content.split("\n"):
    print(i) if "group" in i else print(f"    {i}")

CodePudding user response:

A group starts with the word group and ends just before the next group or the end of the input. Use a lookahead in the regex to check if group or end of string would match next.

pattern = r"(group\s\d .*?)(?=group|\Z)"

Use the re.DOTALL flag so the . in the pattern will match newlines.

To indent the subgroups, use .replace() method or split() and join() methods:

for match in re.finditer(pattern, content, re.DOTALL):
    print(match[0].strip().replace('\n', '\n    '))

or

for match in re.finditer(pattern, content, re.DOTALL):
    print('\n'.join(match[0].strip().split()))
  • Related