I am parsing a string with bundle configurations. For simplicity sake, let's say there are only two layouts:
WEIGHT*SIZE
and
SIZE*WEIGHT
sample data looks like this:
12g*15
13g*20
20pack*2.5kg
40packs*15g
10p*35g
Regex I am using now is basically two regex expressions for each layout divided by '|':
(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
But in first two lines it gives result in group(1), and for later 3 result is in group(2).
For the simplicity sake, can I somehow have result only in group(1) regardless which side of "|" in my regex fired? So I don't need to iterate through groups after using re.search?
(I know i could just do ([\d.]{1,3})(?:g|kg)
, but I need to fetch weight form exactly this types of layouts, single weight without bundle size like 5kg
should not be taken into account)
CodePudding user response:
You didn't specify which language/flavour of RegEx you are using, but assuming you are using Python here are a few possible solutions:
Option 1: Select first non-empty capturing group
This is the solution proposed in the comment from Tim Biegeleisen, and probably the quickest. Might look something like this:
import re
pattern = '(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = re.compile(pattern)
for e in examples:
for match in rx.finditer(e):
for g in match.groups():
if g:
print(g)
Output:
12
13
2.5
15
35
Option 2: Use named capturing groups
The syntax is (?P<name>regex)
as per this page. RegEx allows the same name for two capturing groups, so you could modify your RegEx to the following:
(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
However, Python's in-built re
module does not supported identically named groups (as per this answer), so you would need to pip install and import the PyPI regex
module. Might look like this:
import regex
pattern = r'(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = regex.compile(pattern)
for e in examples:
for x in rx.finditer(e):
print(x.group("weight"))
Output:
12
13
2.5
15
35
Option 3: Rewrite the RegEx so that you can put both options inside a single named capturing group
You could make the parts before and after the weight optional so that you just have a single instance of the weight group:
(?:(?:[0-9]{1,3})(?:packs|pack|p|)\*)?([\d.,]{1,3})(?:g|kg)(?:\*(?:[0-9]{1,3})(?:packs|pack|p|))?
The above RegEx captures the number for the weight only in Group 1 for all of your examples. However it will also capture weights for a string like 40packs*15g*40packs
that doesn't match your initial spec. You should be able to rewrite it to be more strict while still keeping only a single capturing group, but it might end up getting quite long.