I want to extract the exact value unit from the 8 rows below. But my code is not working properly. For example, my code extracts in row 4 "1000 2 ounce" instead of only "2 ounce". Same issue with row 5 and 6.
- includes chalkboard labels, marker & measuring cup capacity s | 135.2 ounce capacity
- 33% thicker - 12 glass spice - 3 cubic inch capacity
- nolopau 1.53 gallon capacity lon glass jar,
- [1000 2 ounce capacity compostable condiment souffluide bagasse
- bia cordon bleu inc 900712 8.5 ounce capacity porcelain
- bekith 15 9 liter capacity glass jars,
- echtpower bento box, 1.5 milliliter lunch box with handle
- aislor plastic straining 0.50 gallon capacity lon pitcher, dishwasher
Outcome for each row: 153.2 ounce 3 cubic inch 1.53 gallon 2 ounce 8.5 ounce 9 liter 1.5 milliliter 0.50 gallon
I tried:
def extract_c(df):
pattern = (\d .\d \s*ounce|\d .\d \s*cubic\s*inch|\d .\d \s*gallon|\d .\d \s*liter|\d .\d \s*milliliter)'
return df.str.extract(pattern)
Can anyone support me in extracting these 8 rows in 1 line?
CodePudding user response:
You need to escape .
to literally match .
. The following regex should work. Notice the decimal part also needs to be optional (?:\.\d )?
(\d (?:\.\d )?\s*(?:ounce|cubic\s*inch|gallon|liter|milliliter))
s.str.extract('(\d (?:\.\d )?\s*(?:ounce|cubic\s*inch|gallon|liter|milliliter))')
0
0 135.2 ounce
1 3 cubic inch
2 1.53 gallon
3 2 ounce
4 8.5 ounce
5 9 liter
6 1.5 milliliter
7 0.50 gallon