I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}
.
Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.
The target is converting 'valid' numbers to float
, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.
I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).
Here's what I've tried so far but the problem is well above my regex skills:
p = '\d (?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
print(s, re.findall(p, s, re.IGNORECASE))
CodePudding user response:
You can use
import re
p = r'\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d )(?:(?(2)(?!\2))[.,](\d ))?\b(?![,.]\d)'
def postprocess(x):
if x.group(3):
return f"{x.group(1).replace(',','').replace('.','')}.{x.group(3)}"
elif x.group(2):
return f"{x.group(1).replace(',','').replace('.','')}"
else:
return None
texts = ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']
for s in texts:
print(s, '=>', list(filter(None, [postprocess(x) for x in re.finditer(p, s)])) )
Output:
blabla 1,25 10.587.256,25 euros => ['1.25', '10587256.25']
6.010,12 => ['6010.12']
6.010 => ['6010']
6,010 => ['6010']
6,010.12 => ['6010.12']
6010,124 => ['6010.124']
05.12.2018 => []
12.05.18 => []
The regex is
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d )(?:(?(2)(?!\2))[.,](\d ))?\b(?![,.]\d)
Details:
\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|
- matches a whole word, 1-2 digits,.
, 1-2 digits,.
, 2 or 4 digits (this match will be skipped)\b
- a word boundary(?<!\d[.,])
- a negative lookbehind failing the match if there is a digit and a.
or,
immediately on the left(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d )
- Group 1:\d{1,3}
- one, two or three digits(?=([.,])?)
- there must be an optional Group 2 capturing a.
or,
immediately on the right(?:\2\d{3})*
- zero or more sequences of Group 2 value and then any three digits|
- or\d
- one or more digits
(?:(?(2)(?!\2))[.,](\d ))?
- an optional sequence of(?(2)(?!\2))
- if Group 2 matched, the next char cannot be Group 2 value[.,]
- a comma or dot(\d )
- Group 3: one or more digits
\b
- a word boundary(?![,.]\d)
- a negative lookahead failing the match if there is a,
or.
and a digit immediately on the right.
The postprocess
method returns None if no capturing group matched, or a number with no commas or dots in the integer part.