Disclaimer - This is not a homework question. Normally I wouldn't ask something so simple, but I can't find an elegant solution.
What I am trying to achieve -
Input from OCR: "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
Parsed output: "01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022"
Essentially remove spaces from in between digits. However there are caveats like that preceding digits are single character long (to avoid stuff like dates). For a date like "1 1 1970" though it's fine if it gets converted to "11 1970" since it doesn't violate the single character principle.
The most decent regex I could think of was (.*?\D)\d( \d)
. However this doesn't work for numbers at the beginning of the string. Also search and replace is fairly complicated with this regex (I can't do a re.subn
with this).
Can anyone think of an elegant Python based solution (preferably using regex) to achieve this?
CodePudding user response:
Perhaps you can capture and skip the date like format or the digits, and match 1 whitespace chars in between digits to remove.
In the replacement use the capture groups.
\b(\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)
The pattern matches:
\b
A word boundary to prevent a partial word match(\d{1,2}\s \d{1,2}\s \d{4})\b
Capture a date like pattern in group 1|
Or(\d )
Capture group 2, match 1 digits\s
Match 1 whitespace chars (that will be removed)(?!
Negative lookahead, assert what is directly to the right of the current position is not\D
Match a non digits|
Or\d{1,2}\s \d{1,2}\s \d{4}\b
Match the date like pattern
)
Close the negative lookahead
import re
pattern = r"(\b\d{1,2}\s \d{1,2}\s \d{4})\b|(\d )\s (?!\D|\d{1,2}\s \d{1,2}\s \d{4}\b)"
s = "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
result = re.sub(pattern, r"\1\2", s)
if result:
print (result)
Output
01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022
CodePudding user response:
>>> import re
>>> regex = re.compile(r"(?<=\b\d)\s (?=\d\b)")
>>> regex.sub("", "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022")
'01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022'
Explanation:
(?<=\b\d) # Assert that a single digit precedes the current position
\s # Match one (or more) whitespace character(s)
(?=\d\b) # Assert that a single digit follows the current position
The sub()
operation removes all whitespace that matches this rule.