I have strings like:
Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'
I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]
0.75 oz
8 oz
12 ml
30 ml
I have the below code:
import re
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9] .*[0-9]* *(?={look_ahead})[a-zA-Z] )")
print(size_and_units)
Which outputs this:
0 [0.75 oz]
1 [8oz]
2 [4.4" dia x 2.5", 12ml]
3 [3.3" 30ml]
You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.
I am out of my depth for regex. Can someone help fix this. Thank you!
CodePudding user response:
Making as few changes to your regex, so you know what you did wrong:
in [0-9] .*[0-9]*
, replace .
with \.
. .
means any character. \.
means a period.
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9] \.*[0-9]* *(?={look_ahead})[a-zA-Z] )")
gives:
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
You don't need to use a lookahead at all though, since you also want to match the units. Just do
\d \.*\d*\s*(?:oz|ml)
This gives the same result:
size_and_units = s.str.findall(
rf"\d \.*\d*\s*(?:{look_ahead})")
CodePudding user response:
Some notes about the pattern that you tried:
- You can omit the lookahead
(?!,)
as it is always true because you start the next match for a digit - In this part
.*[0-9]* *(?=oz|ml)[a-zA-Z] )
this is all optional.*[0-9]* *
and will match until the end of the string. Then it will backtrack till it can match eitheroz
orml
and will match 1 or more chars a-zA-Z so it could also match0.75 ozaaaaaaa
If you want the matches, you don't need a capture group or lookarounds. You can match:
\b\d (?:\.\d )*\s*(?:oz|ml)\b
\b
A word boundary to prevent a partial word match\d (?:\.\d )*
Match 1 digits with an optional decimal part\s*(?:oz|ml)
Match optional whitespace chars and eitheroz
orml
\b
A word boundary
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"\b\d (?:\.\d )*\s*(?:{look_ahead})\b")
print(size_and_units)
Output
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
CodePudding user response:
I think that regex expression will work for you.
[0-9] \.*[0-9]* *(oz|ml)