Home > other >  Regex to match (French) numbers
Regex to match (French) numbers

Time:10-04

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.

So my first attempt is:

number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)

... but the trouble is that this doesn't then match a single digit.

But if I do this

number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)

it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).

The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.

How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.

Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:

doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)

... having to use a two-stage process would make this much more lengthy and fiddly.

Edit

I tried to see whether I could implement ThierryLathuille's idea, so I tried:

re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)

... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.

I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?

CodePudding user response:

What about:

\d{1,3}(?:[\s.]?\d{3})*(?:,\d )?(?!\d)

See an online demo

  • \d{1,3} - 1-3 digits.
  • (?: - Open 1st non-capture group:
    • [\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
    • \d{3} - Three digits.
    • )* - Close 1st non-capture group and match 0 times.
  • (?:,\d )? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
  • (?!\d) - A negative lookahead to prevent trailing digits.

CodePudding user response:

Very much inspired by JvdV's answer, I suggest this:

number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d )?(?!\d))',  flags=re.UNICODE)

... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.

  • Related