I am trying to obtain a number in a string. However, the number must follow a very specific pattern in the sentence. We can pretend that I am trying to find the maximum weight for a parcel in a string. For example The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too
. In this case, I want to match the number 7.34
. My pattern in english is:
<Starts with phrase which is in ('must not exceed', 'cannot exceed', 'limited by')><Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters><a positive or negative int or decimal which may or may not be comma separated (ie. 1,032.43kg)><Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters><ends with the characters which are in ('kg', 'k.g', 'k/g')
What I have is:
(must not exceed|cannot exceed|limited by).*?[0-9] ?(kg|k\.g|k\/g)
However, the main things I cannot do is be able to match <Can be any characters of any length (0 length to infinite) as long as its not any alphabet characters>
and `<a positive or negative int or decimal which may or may not be comma separated (ie. 1,032.43kg)>
Some Examples
The 3 boxes with red on them must not exceed -23.4435kg and don't pick them up.
Parcels that can be sent are limited by 1,402kg and its okay to send
The 2 boxes on the shelf must not exceed:
102 kg
Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g
Do not pick up boxes that weight 56.23 kg
Boxes cannot exceed -23 k/g
I understand that I may need to double regex match. So I regex match the sentence first (ie must not exceed -23.4435kg
then regex match the number, which is what I am doing in my code at the moment. My question is essentially how do I regex match the correct part of the strings.
CodePudding user response:
I suggest using
\b((?:must\s |can)not\s exceed|limited\s by)\W*?(-?\d (?:,\d )*(?:\.\d )?)\s*(kg|k\.g|k/g)\b
See the regex demo. Details:
\b
- a word boundary((?:must\s |can)not\s exceed|limited\s by)
- Group 1:must not exceed
,cannot exceed
orlimited by
with any whitespaces in between words\W*?
- any zero or more, but as few as possible, non-word chars(-?\d (?:,\d )*(?:\.\d )?)
- Group 2: number pattern, an optional-
, then one or more digits, then zero or more sequences of,
and one or more digits, then an optional occurrence of.
and one or more digits\s*
- zero or more whitespaces(kg|k\.g|k/g)
- Group 3:kg
,k.g
ork/g
\b
- a word boundary
See the Python demo:
import re
texts = ['The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too',
'The 3 boxes with red on them must not exceed -23.4435kg and don\'t pick them up.',
'Parcels that can be sent are limited by 1,402kg and its okay to send',
'The 2 boxes on the shelf must not exceed: \n102 kg',
'Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g',
'Do not pick up boxes that weight 56.23 kg',
'Boxes cannot exceed -23 k/g',
'must not exceed -23.4435kg']
rx = re.compile(r'\b((?:must\s |can)not\s exceed|limited\s by)\W*?(-?\d (?:,\d )*(?:\.\d )?)\s*(kg|k\.g|k/g)\b')
for text in texts:
print("----", text,"----")
m = rx.search(text)
if m:
print(f"Phrase: {m.group(1)}")
print(f"Number: {m.group(2)}")
print(f"UOM: {m.group(3)}")
else:
print("Not matched!")
Output:
---- The 3rd box marked 4 cannot exceed 7.34kg but if its exact 8.42kg its okay too ----
Phrase: cannot exceed
Number: 7.34
UOM: kg
---- The 3 boxes with red on them must not exceed -23.4435kg and don't pick them up. ----
Phrase: must not exceed
Number: -23.4435
UOM: kg
---- Parcels that can be sent are limited by 1,402kg and its okay to send ----
Phrase: limited by
Number: 1,402
UOM: kg
---- The 2 boxes on the shelf must not exceed:
102 kg ----
Phrase: must not exceed
Number: 102
UOM: kg
---- Do not pick up 18 boxes at a time and make sure they cannot exceed,: 92302 k.g ----
Phrase: cannot exceed
Number: 92302
UOM: k.g
---- Do not pick up boxes that weight 56.23 kg ----
Not matched!
---- Boxes cannot exceed -23 k/g ----
Phrase: cannot exceed
Number: -23
UOM: k/g
---- must not exceed -23.4435kg ----
Phrase: must not exceed
Number: -23.4435
UOM: kg
P.S. In Oracle, you will need to throw away word boundaries and replace all non-capturing groups with capturing ones:
REGEXP_SUBSTR(
col,
'((must\s |can)not\s exceed|limited\s by)\W*?(-?\d (,\d )*(\.\d )?)\s*(kg|k\.g|k/g)',
1,1, NULL, 3)