Home > front end >  Nice regex for cleaning up space separated digits in Python
Nice regex for cleaning up space separated digits in Python

Time:07-13

Disclaimer - This is not a homework question. Normally I wouldn't ask something so simple, but I can't find an elegant solution.

What I am trying to achieve -

Input from OCR: "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
Parsed output: "01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022"

Essentially remove spaces from in between digits. However there are caveats like that preceding digits are single character long (to avoid stuff like dates). For a date like "1 1 1970" though it's fine if it gets converted to "11 1970" since it doesn't violate the single character principle.

The most decent regex I could think of was (.*?\D)\d( \d) . However this doesn't work for numbers at the beginning of the string. Also search and replace is fairly complicated with this regex (I can't do a re.subn with this).

Can anyone think of an elegant Python based solution (preferably using regex) to achieve this?

CodePudding user response:

Perhaps you can capture and skip the date like format or the digits, and match 1 whitespace chars in between digits to remove.

In the replacement use the capture groups.

\b(\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • (\d{1,2}\s \d{1,2}\s \d{4})\b Capture a date like pattern in group 1
  • | Or
  • (\d ) Capture group 2, match 1 digits
  • \s Match 1 whitespace chars (that will be removed)
  • (?! Negative lookahead, assert what is directly to the right of the current position is not
    • \D Match a non digits
    • | Or
    • \d{1,2}\s \d{1,2}\s \d{4}\b Match the date like pattern
  • ) Close the negative lookahead

Regex demo

import re

pattern = r"(\b\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)"
s = "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
result = re.sub(pattern, r"\1\2", s)

if result:
    print (result)

Output

01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022

CodePudding user response:

>>> import re
>>> regex = re.compile(r"(?<=\b\d)\s (?=\d\b)")
>>> regex.sub("", "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022")
'01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022'

Explanation:

(?<=\b\d) # Assert that a single digit precedes the current position
\s        # Match one (or more) whitespace character(s)
(?=\d\b)  # Assert that a single digit follows the current position

The sub() operation removes all whitespace that matches this rule.

  • Related