I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0
is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
CodePudding user response:
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d )?(?!\d)
See an online demo
\d{1,3}
- 1-3 digits.(?:
- Open 1st non-capture group:[\s.]?
- An optional whitespace or literal dot. Note that with unicode\s
should match\p{Z}
to include the non-breaking whitespace.\d{3}
- Three digits.)*
- Close 1st non-capture group and match 0 times.
(?:,\d )?
- A 2nd optional non-capture group to match a comma followed by at least 1 digit.(?!\d)
- A negative lookahead to prevent trailing digits.
CodePudding user response:
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d )?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.