Home > Blockchain >  regex extract digit word
regex extract digit word

Time:12-19

I want to extract the exact value unit from the 8 rows below. But my code is not working properly. For example, my code extracts in row 4 "1000 2 ounce" instead of only "2 ounce". Same issue with row 5 and 6.

  1. includes chalkboard labels, marker & measuring cup capacity s | 135.2 ounce capacity
  2. 33% thicker - 12 glass spice - 3 cubic inch capacity
  3. nolopau 1.53 gallon capacity lon glass jar,
  4. [1000 2 ounce capacity compostable condiment souffluide bagasse
  5. bia cordon bleu inc 900712 8.5 ounce capacity porcelain
  6. bekith 15 9 liter capacity glass jars,
  7. echtpower bento box, 1.5 milliliter lunch box with handle
  8. aislor plastic straining 0.50 gallon capacity lon pitcher, dishwasher

Outcome for each row: 153.2 ounce 3 cubic inch 1.53 gallon 2 ounce 8.5 ounce 9 liter 1.5 milliliter 0.50 gallon

I tried:

def extract_c(df):

pattern =  (\d .\d \s*ounce|\d .\d \s*cubic\s*inch|\d .\d \s*gallon|\d .\d \s*liter|\d .\d \s*milliliter)'

return df.str.extract(pattern)

Can anyone support me in extracting these 8 rows in 1 line?

CodePudding user response:

You need to escape . to literally match .. The following regex should work. Notice the decimal part also needs to be optional (?:\.\d )?

(\d (?:\.\d )?\s*(?:ounce|cubic\s*inch|gallon|liter|milliliter))

s.str.extract('(\d (?:\.\d )?\s*(?:ounce|cubic\s*inch|gallon|liter|milliliter))')
                0
0     135.2 ounce
1    3 cubic inch
2     1.53 gallon
3         2 ounce
4       8.5 ounce
5         9 liter
6  1.5 milliliter
7     0.50 gallon
  • Related