I'm trying to use regex to match numbers that are followed by units, including spaces since sometimes the text isn't clean. For example, if I have some text like this:
blah blah 5/8" blah blah 60lbs blah blah 1 /8" blah blah 40 lbs 6oz
I would want to match:
5/8"
60lbs
1 /8"
40 lbs
6oz
I was thinking of having a variable where I can set the unit (run a loop through a list of units) and adding that to the regex expression so that the expression basically matches
some numbers
unit
but I'm having trouble coming up with the regex expression for matching everything before the unit
.
Would appreciate any help! Thank you!
Note: I can also alter the text if that's easier. I thought maybe removing all spaces could be helpful but that might also complicate things more.
CodePudding user response:
I have a Python solution that works
import re
text = """blah blah 5/8" blah blah 60lbs
blah blah 580/18" blah blah 60lbs
blah blah 1 /8" blah blah 40 lbs 6oz, 5Kg"""
units = ['"', 'lbs', 'oz', 'kg'] # add lower cacse units of measure
digit_regex = ['(\d*?\/\d*?', '(\d{2,}?'] # [digits with "/" between, 2 or more digits]
results = []
for measure_unit in units:
for digit in digit_regex:
pattern = f'{digit}{measure_unit})'
for match in re.findall(pattern, text.replace(' ', '').lower()):
if match is not None and match !='':
results.append(match)
print(results)
The output will be:
['5/8"', '580/18"', '1/8"', '18"', '60lbs', '60lbs', '40lbs']
[Finished in 21ms]
It need some further thinking because there are situations where a match like '18"'
shouldn't be present since it's a part of '580/18"'
but I got you going
CodePudding user response:
In unit
tag, add all units:
(?:(?P<num>(?:(?:([\ -]\s*)?[1-9]\d*)|0)(?:\s*\/\d*)?)(?:\s*(?P<unit>\"|lbs|oz))?)
python
test='blah blah 5/8" blah blah 60lbs blah blah 1 /8" blah blah 40 lbs 6oz'
units=['"','lbs','oz'] # define list of accepted units
pattern=fr'(?:(?P<num>(?:(?:([\ -]\s*)?[1-9]\d*)|0)(?:\s*\/\d*)?)(?:\s*(?P<unit>{"|".join(units)}))?)'
r = re.compile(pattern)
res=[m.groupdict() for m in r.finditer(test)]
Generated pattern:
'(?:(?P<num>(?:(?:([\\ -]\\s*)?[1-9]\\d*)|0)(?:\\s*\\/\\d*)?)(?:\\s*(?P<unit>"|lbs|oz))?)'
res:
[{'num': '5/8', 'unit': '"'},
{'num': '60', 'unit': 'lbs'},
{'num': '1 /8', 'unit': '"'},
{'num': '40', 'unit': 'lbs'},
{'num': '6', 'unit': 'oz'}]
regex101 Result