Home > database >  regex phrase, positive or negative number/decimal with units
regex phrase, positive or negative number/decimal with units

Time:11-02

I am trying to obtain a number in a string. However, the number must follow a very specific pattern in the sentence. We can pretend that I am trying to find the maximum weight for a parcel in a string. For example The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too. In this case, I want to match the number 7.34. My pattern in english is:

<Starts with phrase which is in ('must not exceed', 'cannot exceed', 'limited by')><Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters><a positive or negative int or decimal which may or may not be comma separated (ie. 1,032.43kg)><Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters><ends with the characters which are in ('kg', 'k.g', 'k/g')

What I have is:

(must not exceed|cannot exceed|limited by).*?[0-9]  ?(kg|k\.g|k\/g)

However, the main things I cannot do is be able to match <Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters> and `<a positive or negative int or decimal which may or may not be comma separated (ie. 1,032.43kg)>

Some Examples

The 3 boxes with red on them must not exceed -23.4435kg and don't pick them up.
Parcels that can be sent are limited by 1,402kg and its okay to send
The 2 boxes on the shelf must not exceed: 
102 kg
Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g
Do not pick up boxes that weight 56.23 kg
Boxes cannot exceed -23 k/g

I understand that I may need to double regex match. So I regex match the sentence first (ie must not exceed -23.4435kg then regex match the number, which is what I am doing in my code at the moment. My question is essentially how do I regex match the correct part of the strings.

CodePudding user response:

I suggest using

\b((?:must\s |can)not\s exceed|limited\s by)\W*?(-?\d (?:,\d )*(?:\.\d )?)\s*(kg|k\.g|k/g)\b

See the regex demo. Details:

  • \b - a word boundary
  • ((?:must\s |can)not\s exceed|limited\s by) - Group 1: must not exceed, cannot exceed or limited by with any whitespaces in between words
  • \W*? - any zero or more, but as few as possible, non-word chars
  • (-?\d (?:,\d )*(?:\.\d )?) - Group 2: number pattern, an optional -, then one or more digits, then zero or more sequences of , and one or more digits, then an optional occurrence of . and one or more digits
  • \s* - zero or more whitespaces
  • (kg|k\.g|k/g) - Group 3: kg, k.g or k/g
  • \b - a word boundary

See the Python demo:

import re
texts = ['The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too',
    'The 3 boxes with red on them must not exceed -23.4435kg and don\'t pick them up.',
    'Parcels that can be sent are limited by 1,402kg and its okay to send',
    'The 2 boxes on the shelf must not exceed: \n102 kg',
    'Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g',
    'Do not pick up boxes that weight 56.23 kg',
    'Boxes cannot exceed -23 k/g',
    'must not exceed -23.4435kg']

rx = re.compile(r'\b((?:must\s |can)not\s exceed|limited\s by)\W*?(-?\d (?:,\d )*(?:\.\d )?)\s*(kg|k\.g|k/g)\b')
for text in texts:
    print("----", text,"----")
    m = rx.search(text)
    if m:
        print(f"Phrase: {m.group(1)}")
        print(f"Number: {m.group(2)}")
        print(f"UOM: {m.group(3)}")
    else:
        print("Not matched!")

Output:

---- The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too ----
Phrase: cannot exceed
Number: 7.34
UOM: kg
---- The 3 boxes with red on them must not exceed -23.4435kg and don't pick them up. ----
Phrase: must not exceed
Number: -23.4435
UOM: kg
---- Parcels that can be sent are limited by 1,402kg and its okay to send ----
Phrase: limited by
Number: 1,402
UOM: kg
---- The 2 boxes on the shelf must not exceed: 
102 kg ----
Phrase: must not exceed
Number: 102
UOM: kg
---- Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g ----
Phrase: cannot exceed
Number: 92302
UOM: k.g
---- Do not pick up boxes that weight 56.23 kg ----
Not matched!
---- Boxes cannot exceed -23 k/g ----
Phrase: cannot exceed
Number: -23
UOM: k/g
---- must not exceed -23.4435kg ----
Phrase: must not exceed
Number: -23.4435
UOM: kg

P.S. In Oracle, you will need to throw away word boundaries and replace all non-capturing groups with capturing ones:

 REGEXP_SUBSTR(
   col,
   '((must\s |can)not\s exceed|limited\s by)\W*?(-?\d (,\d )*(\.\d )?)\s*(kg|k\.g|k/g)',
   1,1, NULL, 3)
  • Related