Regex to match all numbers, integer and decimals, without knowing the Numeric Convention-CodePudding

I want to create a regex to match numbers.

These numbers may:

be integers or decimals
follow the American guidelines (commas as thousand separators and dots for decimal markers) or the European guidelines (the opposite) or even with spaces as thousand separator.

The regex may have to match numbers within sentences with alphanumerical characters before and after. (13.4% - 13.7%, less than 123 000nmol/L, etc)

The match will combine:

NAME	VALUE
NUMBER	[0-9]
SYMBOL	[,. ] (dot, comma and space)

NAME	VALUE	HOW TO DETERMINE IT
DECIMAL_MARKER	`,` OR `.`	-> MUST be only one -> MUST be the last SYMBOL on the right (no other SYMBOL after it) -> MUST be different from THOUSAND_SEPARATOR
THOUSAND_SEPARATOR	`,` OR `.` OR	-> CAN be one or more -> MUST be the very first SYMBOL on the left (no other SYMBOL before it) -> MUST be followed by a group of 3 NUMBER -> MUST be different from DECIMAL_MARKER

There is an ambiguity in that case : 1.123 or 1,123 => no way to how for sure if the SYMBOL is a THOUSAND_SEPARATOR or a DECIMAL_MARKER. But anyway, the regex must match it.

What the regex should match as one number:

3
13 
330,1
3.1021 
12 300 
1,000,000 
20.000.000 
1,044.12
1 044,120.12

What the regex should NOT match as one number:

RULE 1 → No SYMBOL next to each other:

3..4
3,,4
3  4
10.,2
10.,.2
10 ,2
10..... 2

RULE 2 → No SYMBOL after DECIMAL_MARKER:

12,300.3 3
12 300.230,12.3

RULE 3 → 3 numbers after THOUSAND_SEPARATOR:

RULE 4 → No multiple SYMBOL for THOUSAND_SEPARATOR. Stick to one:

1 123,123 123

RULE 5 → No DECIMAL_MARKER at the beginning of the NUMBER. Must match here 40 and not 0.40:

.40
,40

I tried regex like this one : \d ( \d{3}) |\d [,.]\d |\d \d ( \d{3}) |\d (https://regex101.com/r/aZ6Wax/1) but it's not exactly covering all the cases, especially the conflicts that may exists between commas and dots. I think there's too much constraints, maybe I should divide this regex ?

CodePudding user response：

For your example data, you might use:

^\d{1,3}(?:[,. ]\d{3}(?:([. ,])\d{3}(?:\1\d{3})*)?)?(?:(?!\1)[.,]\d )?$

The ([. ,]) captures the delimiter in a starting second repetition of 3 digits and the backreference \1 matches the same char.

The negative lookahead here (?!\1)[.,] matches either a dit or comma that is not already matched.

Regex demo

CodePudding user response：

Yet another option:

^\d{1,3}( \d{3})?(([\., ])\d{3}(\4\d{3})*)?([\.,]\d )?$

This regex works on matching:

^: start of string
\d{1,3}: a set of always present digits
( \d{3})?: an optional thousands group with a space followed by 3 digits
(([\., ])\d{3}(\4\d{3})*)?: the thousands group
- ([\., ])\d{3}: the first thousand part, composed of a delimiter and three digits
- (\4\d{3})*: other optional thousand parts, composed of the same previous delimiter and three digits
([\.,]\d )?: the decimals group, composed of either a comma or a dot and a sequence of digits
$: end of string

Check the demo here.