How to get all repeating groups in regex expression (python)-CodePudding

I have a fixed width piece of data which can be repeated any number of times, embedded in XML.

each piece of data is of the format:

2234599999920210930

So 5 digits, followed by 6 digits, followed by 8 digits.

So it could also be:

11111222222333333334444455555566666666

Where 2 sets of fixed width data has been entered.

The data always resides between the same tags:

<content>12345999999202109302234599999920210930</content>

I want to get each group of data out with the subgroups named, id1, id2, id3. So for my second example I would have:

id1 : 11111
id2 : 222222
id3 : 33333333

id1 : 44444
id2 : 555555
id3 : 66666666

I have tried:

<content.*>((?P<id1>[a-zA-Z0-9]{5})(?P<id2>[a-zA-Z0-9]{6})(?P<id3>[a-zA-Z0-9]{8})) <\/content>

Which finds the data between the content, but only ever gives me the last set of groups.

An example is at

https://regex101.com/r/We49Sc/1

Any help really appreciated

CodePudding user response：

Obligatory don't use regex with HTML.

That aside, if you do use an XML/HTML parser to get the contents of the <content> tags and you can guarantee that you only have fixed length strings, you're probably better just splitting that into fixed lengths as expected by your format.

>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)/4
>>> [ x[i:i chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']

Split string into fixed lengths - Alexander

CodePudding user response：

You really should be using an XML parser. But I present the following for any instructive value it might have for those situations where using an XML parser would not be applicable:

If you use the regex package installable from the PYPI repository, additional regex capabilities are provided that are available with the regex engines that come with the PHP and Perl languages.

See Regex Demo

import regex as re

# Use re.X (re.VERBOSE) flag where whitespace within the pattern is ignored
# and you can add comments:
pattern = """(?x)
(?:
    <content[^>]*>      # Matches '<content ...>
    |                   # OR
    \G                  # Either the start of the search (start of string) of the last match
    (?!\A)              # But not the start of string -- so therefore the only last match
)
\K                      # Don't match anything prior to this point
(?P<id1>[0-9]{5})
(?P<id2>[0-9]{6})
(?P<id3>[0-9]{8})
(?=.*</content>)        # followed eventually by </content>
"""

s = '<content>12345999999202109302234599999920210930</content>'

for m in re.finditer(pattern, s):
    print(m['id1'], m['id2'], m['id3'])
# or
print(re.findall(pattern, s))

Prints:

12345 999999 20210930
22345 999999 20210930
[('12345', '999999', '20210930'), ('22345', '999999', '20210930')]

CodePudding user response：

Another way:

import re

text = "<content>12345999999202109302234599999920210930</content>"
regex = re.compile(r"(?P<id1>\d{5})(?P<id2>\d{6})(?P<id3>\d{8})(?=(?:\d{5}\d{6}\d{8})*<\/content>)")

for match in regex.finditer(text):
  print(match.groupdict())

# {'id1': '12345', 'id2': '999999', 'id3': '20210930'}
# {'id1': '22345', 'id2': '999999', 'id3': '20210930'}

# As a list: print([match.groupdict() for match in regex.finditer(text)])