Home > Back-end >  Regex to get amount from sentence not working for one scenario
Regex to get amount from sentence not working for one scenario

Time:11-11

I am new to regex. I need to extract amount from sentence:

The watches are INR 2,550 Only Kidswear under INR 399.59 Only Cricket bat INR590 Only

I have created a regex which extracts the first two amounts and tried for 3rd one but still it is not working. Can someone please help.

My Regex - (?i)\\b(\\d (?:[.,]\\d )?)

CodePudding user response:

The word boundary at the beginning prevents the INR590 from matching. But if you omit that word boundary, you would match digits at a lot more places.

For the example string, you could make the pattern a bit more specific instead and add a word boundary at the end of the pattern.

(?i)\bINR\h*(\d (?:[.,]\d )?)\b

Regex demo

In Java:

String regex = "(?i)\\bINR\\h*(\\d (?:[.,]\\d )?)\\b";

You could also for example assert that what is directly to the left is either a space or another allowed character:

(?<=\h|[A-Z])\d (?:[.,]\d )?\b

Regex demo

CodePudding user response:

/\s*INR\s*(\d [,.]{0,1}\d )/g

\s* match any number of whitespace character.

INR match "INR"

\d match any number

[,.]{0,1}\d match between 0 comma or point followed by a number.

(\d [,.]{0,1}\d ) match group number such as "10", "10.1", "10,1". And it's between bracket, you can get exactly the data between.

For greater number wrote as this model: 1.000.000,00 Replace (\d [,.]{0,1}\d ) by (\d (.\d )*([,.]{0,1}\d ))

CodePudding user response:

The following regular expression matches representations of dollar values that satisfy the conventional required format:

(?<=\bINR ?)(?:[1-9]\d{1,2}|\d)(?:,\d{3})*(?:\.\d{2})?(?![\d.,])

Java 8 regex demo

The link tests the following strings.

The watches are INR 2,550 Kidswear under INR 2,399.59 Cricket bat INR590
                    ^^^^^                    ^^^^^^^^                ^^^
Cement mixers are INR 25,34,128 B767 are INR 23,401,798,261.35
                                             ^^^^^^^^^^^^^^^^^
Cough drops are INR 3.241 Bubble gum is INR 01.23

The four matches are indicated by the party hats. 25,34,128 was rejected because there are other than three digits between each successive pair of commas, 3.241 was passed over because it has other than two digits to the right of the decimal point and 01.23 failed the cut because of the leading zero.

The regular expression can be broken down as follows.

(?<=         # begin positive lookbehind
  \bINR ?    # match a word break followed by 'INR'
)            # end positive lookbehind
(?:          # begin non-capture group
  [1-9]      # match a digit other than 0
  \d{1,2}    # match 1 or 2 digits
  |          # or
  \d         # match 1 digit
)            # end non-capture group
(?:          # begin non-capture group
  ,\d{3}     # match ',' followed by 3 digits
)*           # end non-capture group and execute it 0 or more times
(?:          # begin non-capture group
  \.\d{2}    # match '.' then  2 digits
)?           # end non-capture group and execute it 0 or 1 times
(?!          # begin negative lookahead
  [\d.,]     # match a digit, '.' or ','
)            # end negative lookahead

I don't know Java but was a bit surprised to find that the lookbehind ((?<=\bINR ?)) could contain an optional character (a space). If a version of Java is used that does not support that, the lookbehind could be replaced with the following:

(?:(?<=\bINR )|(?<=\bINR))
  • Related