I am looking to parse patterns like "w x h x l" using regex, so basically the letters w,h,l (and others) with "x" in between. There could be text around the searched expression, and "w x h x l x l x h" would be valid as well.
I have tried the regular expression
(w|h|l|b)(\\s\*x\\s\*(w|h|l|b))
but I don't understand why this doesn't work.
Examples (with python's re.findall):
"The measurements are (w x h x l): 5x7x3cm" => [(w,h,l)]
"Measurement options are (wxhxl), (hxlxb): Some random stuff" => [(w,h,l),(h,l,b)]
"The measurements, in form wxhxl: 5x7x3cm" => [(w,h,l)]
CodePudding user response:
You can use your pattern with non-capturing groups to extract all matches, and then split each match with x
to get the separate chars:
import re
texts = [
"The measurements are (w x h x l): 5x7x3cm", # => [(w,h,l)]
"Measurement options are (wxhxl), (hxlxb): Some random stuff", # => [(w,h,l),(h,l,b)]
"The measurements, in form wxhxl: 5x7x3cm" # => [(w,h,l)]
]
for text in texts:
print( [tuple(''.join(x.split()).split('x')) for x in re.findall(r'\b[whlb](?:\s*x\s*[whlb]) \b', text)] )
See the Python demo. Output:
[('w', 'h', 'l')]
[('w', 'h', 'l'), ('h', 'l', 'b')]
[('w', 'h', 'l')]
The \b[whlb](?:\s*x\s*[whlb]) \b
pattern matches
\b
- word boundary[whlb]
- aw
,h
,l
orb
char(?:\s*x\s*[whlb])
- one or more repetitions of anx
enclosed with zero or more whitespaces and then aw
,h
,l
orb
char\b
- word boundary
CodePudding user response:
if you can make use of the PyPi regex module you can use the group captures and a named capture group:
\b(?<pat>[whlb])(?:\s*x\s*(?<pat>[whlb])) \b
\b
A word boundary to prevent a partial word match(?<pat>[whlb])
Group pat match one ofw
h
l
b
(?:
Non capture group to repeat as a whole\s*x\s*(?<pat>[whlb])
Match anx
between optional whitespace chars and again named capture group pat
)
Close the non capture group and repeat it 1 times to match at least a singlex
\b
A word boundary
See a regex demo for the capture group values and a Python demo.
import regex
pattern = r'(?<pat>[whlb])(?:\s*x\s*(?<pat>[whlb])) '
s = ("The measurements are (w x h x l): 5x7x3cm\n"
"The measurements, in form wxhxl: 5x7x3cm\n"
"Measurement options are (wxhxl), (hxlxb): Some random stuff\n"
"w x h x l x l x h")
for m in regex.finditer(pattern, s):
print(tuple(m.captures("pat")))
Output
('w', 'h', 'l')
('w', 'h', 'l')
('w', 'h', 'l')
('h', 'l', 'b')
('w', 'h', 'l', 'l', 'h')