This seems like a fairly simple issue, but I can't get it to work.
I have a text file, which contains JSON like data, but there are a couple of additional lines, stopping it being a valid JSON and I need to remove these. This sounds very simple and even more so, as the valid JSON strings (which I can parse later) are always contained in the following container:
xyz()
So for example, the dataset will be something like:
abcdefg
xyz({"id_value": 123, "text_value": "efg"})
abcdefg
xyz({"id_value": 124, "text_value": "hij"})
Each separate JSON string is always prefixed by abcdefg and then xyz( and there is always a closing bracket after. So the format is consistent.
I was trying the following:
re.findall(r'xyz\(.*?\)', text_file)
However despite attempting variations of this (e.g. using re.search, trying \w etc.) nothing seems to work (by which I mean it returns an empty list).
If I just try to do the following:
re.findall(r'xyz\(
Then it returns:
['xyz(', 'xyz(']
As expected.
So the issue appears to be with the string in the brackets, but I can not work out what the problem is, as other examples on here suggest my code is correct (which it can't be as it doesn't work)!
I presume its something horrifically simple, but I'm a bit stuck!
CodePudding user response:
You can install PyPi regex
module by running pip install regex
(or pip3 install regex
) and then using this library to match strings between xyz(
and the next paired )
char using:
import regex
#...
output = [x.group() for x in regex.finditer(r'xyz(\((?:[^()] |(?1))*\))', text_file)
The list comprehension is used to avoid the issue with regex.findall
when only captured substrings are returned when a capturing group is defined in the regex (and here, the capturing group around parentheses is required since it is recursed inside the pattern with a (?1)
subroutine.
Pattern details:
xyz
-xyz
text(\((?:[^()] |(?1))*\))
- Group 1:\(
- a(
char(?:[^()] |(?1))*
- zero or more repetitions of one or more chars other than(
and)
or the subroutine repeats (recurses) the whole Group 1 pattern\)
- a)
char.