Home > Software engineering >  python regular expression optional but mandatory if character precedes
python regular expression optional but mandatory if character precedes

Time:06-11

I am trying to capture something along the lines of

1/2x1 3x2 - 4/5x3

I will strip the spaces before hand so it is not necessary to capture them in the regular expression. The concern that's happening is that I want the preceding coefficient to have the option of being a fraction. So if I see a / then it must have \d following it. I don't necessarily care to capture the /.

Ideally I would extract the groups as such:

# first match
match.groups(1)
('1', '2', 'x1')

#second match
(' ', '3', 'x2')

#third match
('-', '4', '5', 'x3')

Something that is (sort of) working is ([ -])?(\d) (\/\d)?([a-zA-Z] \d ). However I don't love that it also captures the preceding '/'

Example output:

>>> regexp = re.compile('([ -])?(\d) (\/\d)?([a-zA-Z] \d )')
>>> expr = '1/2a3 1/8x2-4x3'
>>> match = regexp.search(expr)
>>> match.groups(1)
(1, '1', '/2', 'a3')

>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
(' ', '1', '/8', 'x2')

>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
('-', '4', 1, 'x3')

In the first match, what does the first element 1 mean? I see the same thing in the third match, third element. In both of these - that particular "group" is missing. So is that just a way of being like "I matched, but I didn't match anything"?

Another issue with the above regex, is it makes the [ -] optional. I want it to be optional on the first term, but it is mandatory on subsequent terms.

Anyways the above is usable, I'll need to peel off the /, and I can sanitize the input to ensure the - are always there, but it's not as elegant as I'm sure it can be.

Thanks for any help

CodePudding user response:

You could rework your regex slightly to use capturing groups only for things you want to capture and then use re.findall to extract all matches at once:

regexp = re.compile(r'([ -])?(\d )(?:/(\d))?([a-zA-Z] \d )')
res = regexp.findall(expr)

Output:

[
 ('', '1', '2', 'a3'),
 (' ', '1', '8', 'x2'),
 ('-', '4', '', 'x3')
]

Note when there is no fraction (or sign on the first value) the may be empty values ('') in the tuple, you could (if required) filter that out e.g.

[tuple(filter(lambda x:x, tup)) for tup in res]
# [('1', '2', 'a3'), (' ', '1', '8', 'x2'), ('-', '4', 'x3')]

however then you would face the difficulty of knowing which value in each tuple corresponded to which part of the expression.

  • Related