Regex have result in one group when pattern have several options inside-CodePudding

I am parsing a string with bundle configurations. For simplicity sake, let's say there are only two layouts:

WEIGHT*SIZE
and
SIZE*WEIGHT

sample data looks like this:

12g*15
13g*20
20pack*2.5kg
40packs*15g
10p*35g

Regex I am using now is basically two regex expressions for each layout divided by '|':

(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)

But in first two lines it gives result in group(1), and for later 3 result is in group(2).

For the simplicity sake, can I somehow have result only in group(1) regardless which side of "|" in my regex fired? So I don't need to iterate through groups after using re.search?

(I know i could just do ([\d.]{1,3})(?:g|kg), but I need to fetch weight form exactly this types of layouts, single weight without bundle size like 5kg should not be taken into account)

CodePudding user response：

You didn't specify which language/flavour of RegEx you are using, but assuming you are using Python here are a few possible solutions:

Option 1: Select first non-empty capturing group

This is the solution proposed in the comment from Tim Biegeleisen, and probably the quickest. Might look something like this:

import re
pattern = '(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')

rx = re.compile(pattern)
for e in examples:
    for match in rx.finditer(e):
        for g in match.groups():
            if g:
                print(g)

Output:

Option 2: Use named capturing groups

The syntax is (?P<name>regex) as per this page. RegEx allows the same name for two capturing groups, so you could modify your RegEx to the following:

(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)

However, Python's in-built re module does not supported identically named groups (as per this answer), so you would need to pip install and import the PyPI regex module. Might look like this:

import regex
pattern = r'(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')

rx = regex.compile(pattern)
for e in examples:
    for x in rx.finditer(e):
        print(x.group("weight"))

Output:

Option 3: Rewrite the RegEx so that you can put both options inside a single named capturing group

You could make the parts before and after the weight optional so that you just have a single instance of the weight group:

(?:(?:[0-9]{1,3})(?:packs|pack|p|)\*)?([\d.,]{1,3})(?:g|kg)(?:\*(?:[0-9]{1,3})(?:packs|pack|p|))?

The above RegEx captures the number for the weight only in Group 1 for all of your examples. However it will also capture weights for a string like 40packs*15g*40packs that doesn't match your initial spec. You should be able to rewrite it to be more strict while still keeping only a single capturing group, but it might end up getting quite long.