I try to extract information from a complex string with regex. I try to extract what in the first {
an last }
as the content. Unfortunately, I struggle with nested {}
. How is it possible to deal with this ?
I think the key is to balance the {}
over the all regex by I haven't been successful so far... See example below for parenthesis:
Regular expression to match balanced parentheses
import re
my_string = """
extend mineral Uraninite {
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[ ]
power = 0.37
}
}
kinetics {
rate = 3.2e-09 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[ ]
power = 0.37
}
}
}
"""
regex = re.compile(
r"extend\s "
r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?"
r"(?P<species>[^\n ] )\s "
r"{(?P<content>[^}]*)}\n\s }")
extend_list = [m.groupdict() for m in regex.finditer(my_string)]
So far, I got:
print(extended_list["content"])
"""
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[ ]
power = 0.37
"""
Appearly, I need to use the regex package regex because re does not support recursion. Indeed, this seems to work:
import regex as re
pattern = re.compile(r"{(?P<content>((?:[^{}]|(?R))*))}")
extend_list2 = [m.groupdict() for m in pattern.finditer(read_data)]
print(extended_list2["content"])
"""
kinetics {
rate = -3.2e-08 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[ ]
power = 0.37
}
}
kinetics {
rate = 3.2e-09 mol/m2/s
area = Uraninite
y-term, species = Uraninite
w-term {
species = H[ ]
power = 0.37
}
}
"""
But inserting it in the main pattern does not work.
pattern = re.compile(
r"extend\s ([^n]*)"
r"(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?"
r"(?P<species>[^\n ] )\s "
r"{(?P<content>((?:[^{}]|(?R))*))\}")
extend_list = [m.groupdict() for m in pattern.finditer(read_data)]
CodePudding user response:
I believe the current regex can be written as
rx = r"extend\s (.*)(?:(?P<phase>colloid|mineral|basis|isotope|solid-solution)\s )?(?P<species>\S )\s ({(?P<content>((?:[^{}] |(?4))*))})"
The (?R)
is changed into a regex subroutine, ({(?P<content>((?:[^{}] |(?4))*))})
. The group ID is Group 4 and the soubroutine declaration is thus (?4)
. You can quickly test it here.
The [^n]*
looks like a typo, it matches zero or more non-n
chars. I used .*
, that matches zero or more chars other than line break chars as many as possible.
The [^\n ]
looks like an attempt to match non-whitespace chunks, thus I suggest \S
here.