I want to create a regex to match numbers.
These numbers may:
- be integers or decimals
- follow the American guidelines (commas as thousand separators and dots for decimal markers) or the European guidelines (the opposite) or even with spaces as thousand separator.
The regex may have to match numbers within sentences with alphanumerical characters before and after. (13.4% - 13.7%
, less than 123 000nmol/L
, etc)
The match will combine:
NAME | VALUE |
---|---|
NUMBER | [0-9] |
SYMBOL | [,. ] (dot, comma and space) |
NAME | VALUE | HOW TO DETERMINE IT |
---|---|---|
DECIMAL_MARKER | , OR . |
-> MUST be only one -> MUST be the last SYMBOL on the right (no other SYMBOL after it) -> MUST be different from THOUSAND_SEPARATOR |
THOUSAND_SEPARATOR | , OR . OR |
-> CAN be one or more -> MUST be the very first SYMBOL on the left (no other SYMBOL before it) -> MUST be followed by a group of 3 NUMBER -> MUST be different from DECIMAL_MARKER |
There is an ambiguity in that case : 1.123
or 1,123
=> no way to how for sure if the SYMBOL is a THOUSAND_SEPARATOR or a DECIMAL_MARKER. But anyway, the regex must match it.
What the regex should match as one number:
3
13
330,1
3.1021
12 300
1,000,000
20.000.000
1,044.12
1 044,120.12
What the regex should NOT match as one number:
- RULE 1 → No SYMBOL next to each other:
3..4
3,,4
3 4
10.,2
10.,.2
10 ,2
10..... 2
- RULE 2 → No SYMBOL after DECIMAL_MARKER:
12,300.3 3
12 300.230,12.3
- RULE 3 → 3 numbers after THOUSAND_SEPARATOR:
1 3
200 10
200 1232
10 111 22
12,12.23
- RULE 4 → No multiple SYMBOL for THOUSAND_SEPARATOR. Stick to one:
1 123,123 123
- RULE 5 → No DECIMAL_MARKER at the beginning of the NUMBER. Must match here
40
and not0.40
:
.40
,40
I tried regex like this one : \d ( \d{3}) |\d [,.]\d |\d \d ( \d{3}) |\d
(https://regex101.com/r/aZ6Wax/1) but it's not exactly covering all the cases, especially the conflicts that may exists between commas and dots.
I think there's too much constraints, maybe I should divide this regex ?
CodePudding user response:
For your example data, you might use:
^\d{1,3}(?:[,. ]\d{3}(?:([. ,])\d{3}(?:\1\d{3})*)?)?(?:(?!\1)[.,]\d )?$
The ([. ,])
captures the delimiter in a starting second repetition of 3 digits and the backreference \1 matches the same char.
The negative lookahead here (?!\1)[.,]
matches either a dit or comma that is not already matched.
CodePudding user response:
Yet another option:
^\d{1,3}( \d{3})?(([\., ])\d{3}(\4\d{3})*)?([\.,]\d )?$
This regex works on matching:
^
: start of string\d{1,3}
: a set of always present digits( \d{3})?
: an optional thousands group with a space followed by 3 digits(([\., ])\d{3}(\4\d{3})*)?
: the thousands group([\., ])\d{3}
: the first thousand part, composed of a delimiter and three digits(\4\d{3})*
: other optional thousand parts, composed of the same previous delimiter and three digits
([\.,]\d )?
: the decimals group, composed of either a comma or a dot and a sequence of digits$
: end of string
Check the demo here.