Home > Net >  regex | extract numbers preceded by defined strings
regex | extract numbers preceded by defined strings

Time:10-28

I have strings like:

Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'

I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]

0.75 oz
8 oz
12 ml
30 ml

I have the below code:

import re
import pandas as pd
look_ahead = "oz|ml"

s = pd.Series(['Bla bla 0.75 oz. Bottle',
              'Mugs, 8oz. White',
              'Bowls, 4.4" dia x 2.5", 12ml. Natural',
              'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
        rf"((?!,)[0-9] .*[0-9]* *(?={look_ahead})[a-zA-Z] )")

print(size_and_units)

Which outputs this:

0                  [0.75 oz]
1                      [8oz]
2    [4.4" dia x 2.5", 12ml]
3                [3.3" 30ml]

You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.

I am out of my depth for regex. Can someone help fix this. Thank you!

CodePudding user response:

Making as few changes to your regex, so you know what you did wrong: in [0-9] .*[0-9]*, replace . with \.. . means any character. \. means a period.

s = pd.Series(['Bla bla 0.75 oz. Bottle',
              'Mugs, 8oz. White',
              'Bowls, 4.4" dia x 2.5", 12ml. Natural',
              'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
        rf"((?!,)[0-9] \.*[0-9]* *(?={look_ahead})[a-zA-Z] )")

gives:

0    [0.75 oz]
1        [8oz]
2       [12ml]
3       [30ml]

You don't need to use a lookahead at all though, since you also want to match the units. Just do

\d \.*\d*\s*(?:oz|ml)

This gives the same result:

size_and_units = s.str.findall(
        rf"\d \.*\d*\s*(?:{look_ahead})")

CodePudding user response:

Some notes about the pattern that you tried:

  • You can omit the lookahead (?!,) as it is always true because you start the next match for a digit
  • In this part .*[0-9]* *(?=oz|ml)[a-zA-Z] ) this is all optional .*[0-9]* * and will match until the end of the string. Then it will backtrack till it can match either oz or ml and will match 1 or more chars a-zA-Z so it could also match 0.75 ozaaaaaaa

If you want the matches, you don't need a capture group or lookarounds. You can match:

\b\d (?:\.\d )*\s*(?:oz|ml)\b
  • \b A word boundary to prevent a partial word match
  • \d (?:\.\d )* Match 1 digits with an optional decimal part
  • \s*(?:oz|ml) Match optional whitespace chars and either oz or ml
  • \b A word boundary

Regex demo

import pandas as pd

look_ahead = "oz|ml"

s = pd.Series(['Bla bla 0.75 oz. Bottle',
               'Mugs, 8oz. White',
               'Bowls, 4.4" dia x 2.5", 12ml. Natural',
               'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
    rf"\b\d (?:\.\d )*\s*(?:{look_ahead})\b")

print(size_and_units)

Output

0    [0.75 oz]
1        [8oz]
2       [12ml]
3       [30ml]

CodePudding user response:

I think that regex expression will work for you.

[0-9] \.*[0-9]* *(oz|ml)
  • Related