In Python, within a string, remove multiple spaces with a single space, if it's not leading mul-CodePudding

I have various strings that might have multiple leading spaces.

string_1 = '    param A     val A'
string_2 = 'param B    val B'

....

I want to replace all multiple spaces with a single space IF the multiple spaces are not in the start of the string.

I want output of above to become

 string_1 = '    param A val A'z
 string_2 = 'param B val B'

My current solution replaces all multiple spaces with a single space regardless.

 re.sub('\s ',' ',s)

How would I construct a pattern that only captures non leading multiple spaces?

CodePudding user response：

You can use \b\s{2,}\b as your pattern, If those multiple spaces are leading, they are not in a word boundary. Also for multiple spaces use {2,} instead of to exclude single space:

import re


string_1 = "    param A     val A"
string_2 = "param B    val B"


pattern = re.compile(r"\b\s{2,}\b")
for test in (string_1, string_2):
    print(pattern.sub(" ", test))

output:

    param A val A
param B val B

Note: Trailing multiple spaces is not changed this way. To do that you can omit last \b then if converts to a single space again.

As noted by @JvdV, \b doesn't take other range of characters into account. For example if you a string like "[ param A val A ]", the above pattern won't work for it. Instead you can use a Positive Look Behind assertion ((?<=\S)) and Positive Look Ahead assertion ((?=\S)) to match any non-white-space character:

>>> import re
>>> text = "[    param A     val A    ]"
>>> re.sub(r"\b\s{2,}\b", " ", text)
'[    param A val A    ]'
>>> re.sub(r"(?<=\S)\s{2,}(?=\S)", " ", text)
'[ param A val A ]'

CodePudding user response：

Have a go with:

(?<=\S)\s (\s\S|$)

And replace with \1. See an online demo

(?<=\S) - Assert position is preceded by a non-whitespace char;
\s - 1 Whitespace char (greedy);
(\s\S|$) - Capture group to match a whitespace char and a non-whitespace char or a end-string anchor.

The above will leave only leading spaces alone. For example:

import regex as re
s = '    Abc   12$   D   Efg    '
print(re.sub(r'(?<=\S)\s (\s\S|$)', '\1', s))

Prints: Abc 12$ D Efg

CodePudding user response：

split your input into two groups (1st group: leading space, 2nd group: everything else), then eliminate the multispaces from 2nd group, and join groups:

>>> string_1 = '    param A     val A'
>>> sb = re.compile(r'^(\s )(.*)')
>>> mg = sb.match(string_1).groups()
>>> mg[0]   ' '.join(mg[1].split())
'    param A val A'