I am trying to add spaces between characters only for acronyms (all consecutive all-caps words) in Python.
INPUT:
"The PNUD, UN, UCALP and USA and U N."
DESIRED OUTPUT:
"The P N U D, U N, U C A L P and U S A and U N."
I have this solution so far, but I am looking for something more efficient/elegant:
import re
data = "The PNUD, UN, UCALP and USA and U N."
result = re.sub(r'(?=(?!^)[^[a-z]|\s |\W]*)', ' ', data)
result = re.sub(r'\s (\W)', '\g<1>', result)
print(result)
CodePudding user response:
I think the following regex is a lot more trivial solution for this problem
re.sub('([A-Z])(?=[A-Z])', '\\1 ', s)
I'm just using a positive lookahead and a backreference.
CodePudding user response:
Another solution re.sub
with lambda function:
import re
data = "The PNUD, UN, UCALP and USA and U N."
result = re.sub(r"\b[A-Z] \b", lambda g: " ".join(g.group(0)), data)
print(result)
Prints:
The P N U D, U N, U C A L P and U S A and U N.
EDIT: Small benchmark
import re
from timeit import timeit
pat1 = re.compile(r"\b[A-Z] \b")
pat2 = re.compile(r"([A-Z])(?=[A-Z])")
pat3 = re.compile(r"[A-Z](?=[A-Z])") # the same without capturing group
data = "The PNUD, UN, UCALP and USA and U N."
def fn1():
return pat1.sub(lambda g: " ".join(g.group(0)), data)
def fn2():
return pat2.sub(r"\g<1> ", data)
def fn3():
return pat3.sub(r"\g<0> ", data)
t1 = timeit(fn1, number=10_000)
t2 = timeit(fn2, number=10_000)
t3 = timeit(fn3, number=10_000)
print(t1)
print(t2)
print(t3)
Prints:
0.05032820999622345
0.10462480317801237
0.10249458998441696
CodePudding user response:
You can use a single call to re.sub and match a single uppercase char and assert another one to the right.
In the replacement use the match followed by a space using \g<0>
[A-Z](?=[A-Z])
Example
result = re.sub('[A-Z](?=[A-Z])', r'\g<0> ', data)