Nice regex for cleaning up space separated digits in Python-CodePudding

Disclaimer - This is not a homework question. Normally I wouldn't ask something so simple, but I can't find an elegant solution.

What I am trying to achieve -

Input from OCR: "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
Parsed output: "01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022"

Essentially remove spaces from in between digits. However there are caveats like that preceding digits are single character long (to avoid stuff like dates). For a date like "1 1 1970" though it's fine if it gets converted to "11 1970" since it doesn't violate the single character principle.

The most decent regex I could think of was (.*?\D)\d( \d) . However this doesn't work for numbers at the beginning of the string. Also search and replace is fairly complicated with this regex (I can't do a re.subn with this).

Can anyone think of an elegant Python based solution (preferably using regex) to achieve this?

CodePudding user response：

Perhaps you can capture and skip the date like format or the digits, and match 1 whitespace chars in between digits to remove.

In the replacement use the capture groups.

\b(\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)

The pattern matches:

\b A word boundary to prevent a partial word match
(\d{1,2}\s \d{1,2}\s \d{4})\b Capture a date like pattern in group 1
| Or
(\d ) Capture group 2, match 1 digits
\s Match 1 whitespace chars (that will be removed)
(?! Negative lookahead, assert what is directly to the right of the current position is not
- \D Match a non digits
- | Or
- \d{1,2}\s \d{1,2}\s \d{4}\b Match the date like pattern
) Close the negative lookahead

Regex demo

import re

pattern = r"(\b\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)"
s = "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
result = re.sub(pattern, r"\1\2", s)

if result:
    print (result)

Output

01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022

CodePudding user response：

>>> import re
>>> regex = re.compile(r"(?<=\b\d)\s (?=\d\b)")
>>> regex.sub("", "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022")
'01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022'

Explanation:

(?<=\b\d) # Assert that a single digit precedes the current position
\s        # Match one (or more) whitespace character(s)
(?=\d\b)  # Assert that a single digit follows the current position

The sub() operation removes all whitespace that matches this rule.