Home > Back-end >  regex: balancing "{}" in a complex regex (python)
regex: balancing "{}" in a complex regex (python)

Time:11-21

I try to extract information from a complex string with regex. I try to extract what in the first { an last } as the content. Unfortunately, I struggle with nested {}. How is it possible to deal with this ?

I think the key is to balance the {} over the all regex by I haven't been successful so far... See example below for parenthesis: Regular expression to match balanced parentheses

import re

my_string = """
extend mineral Uraninite {
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[ ]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[ ]
            power = 0.37
        }
    }
}
"""

regex = re.compile(
        r"extend\s "
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?"
        r"(?P<species>[^\n ] )\s "
        r"{(?P<content>[^}]*)}\n\s }")
extend_list = [m.groupdict() for m in regex.finditer(my_string)]

So far, I got:

print(extended_list["content"])

"""
    kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[ ]
            power = 0.37
"""

Appearly, I need to use the regex package regex because re does not support recursion. Indeed, this seems to work:

import regex as re
pattern = re.compile(r"{(?P<content>((?:[^{}]|(?R))*))}")
extend_list2 = [m.groupdict() for m in pattern.finditer(read_data)]

print(extended_list2["content"])

"""
kinetics {
        rate = -3.2e-08 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[ ]
            power = 0.37
        }
    }
    kinetics {
        rate = 3.2e-09 mol/m2/s
        area = Uraninite
        y-term, species = Uraninite
        w-term {
            species = H[ ]
            power = 0.37
        }
    }
"""

But inserting it in the main pattern does not work.

pattern = re.compile(
        r"extend\s ([^n]*)"
        r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?"
        r"(?P<species>[^\n ] )\s "
        r"{(?P<content>((?:[^{}]|(?R))*))\}")
extend_list = [m.groupdict() for m in pattern.finditer(read_data)]

CodePudding user response:

I believe the current regex can be written as

rx = r"extend\s (.*)(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?(?P<species>\S )\s ({(?P<content>((?:[^{}]  |(?4))*))})"

The (?R) is changed into a regex subroutine, ({(?P<content>((?:[^{}] |(?4))*))}). The group ID is Group 4 and the soubroutine declaration is thus (?4). You can quickly test it here.

The [^n]* looks like a typo, it matches zero or more non-n chars. I used .*, that matches zero or more chars other than line break chars as many as possible.

The [^\n ] looks like an attempt to match non-whitespace chunks, thus I suggest \S here.

  • Related