Home > Software engineering >  Regex have result in one group when pattern have several options inside
Regex have result in one group when pattern have several options inside

Time:08-19

I am parsing a string with bundle configurations. For simplicity sake, let's say there are only two layouts:

WEIGHT*SIZE
and
SIZE*WEIGHT

sample data looks like this:

12g*15
13g*20
20pack*2.5kg
40packs*15g
10p*35g

Regex I am using now is basically two regex expressions for each layout divided by '|':

(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)

But in first two lines it gives result in group(1), and for later 3 result is in group(2).

For the simplicity sake, can I somehow have result only in group(1) regardless which side of "|" in my regex fired? So I don't need to iterate through groups after using re.search?

(I know i could just do ([\d.]{1,3})(?:g|kg), but I need to fetch weight form exactly this types of layouts, single weight without bundle size like 5kg should not be taken into account)

CodePudding user response:

You didn't specify which language/flavour of RegEx you are using, but assuming you are using Python here are a few possible solutions:

Option 1: Select first non-empty capturing group

This is the solution proposed in the comment from Tim Biegeleisen, and probably the quickest. Might look something like this:

import re
pattern = '(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')

rx = re.compile(pattern)
for e in examples:
    for match in rx.finditer(e):
        for g in match.groups():
            if g:
                print(g)

Output:

12
13
2.5
15
35

Option 2: Use named capturing groups

The syntax is (?P<name>regex) as per this page. RegEx allows the same name for two capturing groups, so you could modify your RegEx to the following:

(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)

However, Python's in-built re module does not supported identically named groups (as per this answer), so you would need to pip install and import the PyPI regex module. Might look like this:

import regex
pattern = r'(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')

rx = regex.compile(pattern)
for e in examples:
    for x in rx.finditer(e):
        print(x.group("weight"))

Output:

12
13
2.5
15
35

Option 3: Rewrite the RegEx so that you can put both options inside a single named capturing group

You could make the parts before and after the weight optional so that you just have a single instance of the weight group:

(?:(?:[0-9]{1,3})(?:packs|pack|p|)\*)?([\d.,]{1,3})(?:g|kg)(?:\*(?:[0-9]{1,3})(?:packs|pack|p|))?

The above RegEx captures the number for the weight only in Group 1 for all of your examples. However it will also capture weights for a string like 40packs*15g*40packs that doesn't match your initial spec. You should be able to rewrite it to be more strict while still keeping only a single capturing group, but it might end up getting quite long.

  • Related