Home > other >  Python Regex to find everything within parenthesis, with a prefix beforehand
Python Regex to find everything within parenthesis, with a prefix beforehand

Time:11-09

This seems like a fairly simple issue, but I can't get it to work.

I have a text file, which contains JSON like data, but there are a couple of additional lines, stopping it being a valid JSON and I need to remove these. This sounds very simple and even more so, as the valid JSON strings (which I can parse later) are always contained in the following container:

xyz()

So for example, the dataset will be something like:

abcdefg
xyz({"id_value": 123, "text_value": "efg"})

abcdefg
xyz({"id_value": 124, "text_value": "hij"})

Each separate JSON string is always prefixed by abcdefg and then xyz( and there is always a closing bracket after. So the format is consistent.

I was trying the following:

re.findall(r'xyz\(.*?\)', text_file)

However despite attempting variations of this (e.g. using re.search, trying \w etc.) nothing seems to work (by which I mean it returns an empty list).

If I just try to do the following:

re.findall(r'xyz\(

Then it returns:

['xyz(', 'xyz(']

As expected.

So the issue appears to be with the string in the brackets, but I can not work out what the problem is, as other examples on here suggest my code is correct (which it can't be as it doesn't work)!

I presume its something horrifically simple, but I'm a bit stuck!

CodePudding user response:

You can install PyPi regex module by running pip install regex (or pip3 install regex) and then using this library to match strings between xyz( and the next paired ) char using:

import regex 
#...
output = [x.group() for x in regex.finditer(r'xyz(\((?:[^()]  |(?1))*\))', text_file)

The list comprehension is used to avoid the issue with regex.findall when only captured substrings are returned when a capturing group is defined in the regex (and here, the capturing group around parentheses is required since it is recursed inside the pattern with a (?1) subroutine.

Pattern details:

  • xyz - xyz text
  • (\((?:[^()] |(?1))*\)) - Group 1:
    • \( - a ( char
    • (?:[^()] |(?1))* - zero or more repetitions of one or more chars other than ( and ) or the subroutine repeats (recurses) the whole Group 1 pattern
    • \) - a ) char.
  • Related