Home > Software design >  How should I make these regex capture groups more succinct?
How should I make these regex capture groups more succinct?

Time:10-30

I'm using python's re library to do this, but it's a basic regex question.

I am receiving a string of coordinate information in degrees-minutes-seconds format without spaces, and I'm parsing it out to discrete coordinate pairs for conversion.

The string is fed to me looking like this (fake coords for example):

102030N0102030E203040N0203040E304050N0304050E405060N0405060E

I am catching it like this:

coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'

coords = re.match(
    re.compile(r"^(\d [NS]{1}\d [EW]{1})(\d [NS]{1}\d [EW]{1})(\d [NS]{1}\d [EW]{1})(\d [NS]{1}\d [EW]{1})"),
    coordstr)

for x in coords.groups():
    print(x)

which gives me

102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E

And allows me to address each coordinate pair as coords.group(1), coords.group(2) and so on.

So it works, but it feels like I'm being too verbose in the pattern. Is there a more succinct way to crawl the line with one of the capture groups, and add each matched group to .groups() as it's encountered? I know I could do it with brute force string slicing but that seems like more trouble than it's worth.

I've read this but it doesn't seem to address what I'm going after in this question.

Because this is for an enterprise and these strings describe raster bounds, I will be validating the string before introducing the regex search and falling back to a gdal object if the string is not found (or corrupted).

CodePudding user response:

Since you will pre-validate the strings you will process with regex, you need not use re.search / re.match with several groups with identical pattern, you can use re.findall to get all \d [NS]\d [EW] pattern matches from your strings:

import re
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.findall(r'\d [NS]\d [EW]', coordstr)
for x in coords:
    print(x)

Output:

102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E

See the Python demo.

NOTE: the list of matches returned by re.findall will always be in the same order as they are in the source text, see this SO post.

  • Related