How can I modify this regex pattern to also remove spaces after a newline \n?-CodePudding

I have a string with excess whitespace. I want to remove any whitespace at the start of each line up to the color. I also want to preserve single spaces between words, not affect colons if they don't precede a percentage (look at the Pastels in the string for an example) and the number of spaces after the colon (1 space for double digits, 2 spaces for single digits). So far I'm preserving everything I want, but I'm not able to get rid of single spaces after the \n.

How do I remove all whitespace after a new line and at the start of the string in one pattern?

I want the string to look like this: 'Red: 80%\nNavy Blue: 15%\nGreen: 3%\nPastels: Pink, Baby Blue, Lavender: 2%'

my_string = '    Red: 80%\n Navy Blue: 15%\n  Green:  3%\n   Pastels: Pink, Baby Blue, Lavender:  2%'

my_pattern = re.compile('(?<![:])[ ]{2,}')    # match 2 or more spaces unless they follow a colon

# the following:
re.sub(my_pattern, '', my_string)
# returns this:
'Red: 80%\n Navy Blue: 15%\nGreen:  3%\nPastels: Pink, Baby Blue, Lavender:  2%'    # Note the number of spaces after the colons and newlines. 
                                                                                    # The space before "Navy Blue" is the problem.

# this would give me the desired result, but what pattern would let me do it all within one re.sub() ?
re.sub(my_pattern, '', my_string).replace('\n ', '\n')
# returns this:
'Red: 80%\nNavy Blue: 15%\nGreen:  3%\nPastels: Pink, Baby Blue, Lavender:  2%'

CodePudding user response：

Found a solution. Far simpler than I was originally thinking:

my_pattern = re.compile('(?m)^\s ')    # (?m) sets to multiline mode
                                       # ^\s  matches any whitespace immediately following the start of a line

# a little cleaner way of writing the same thing:
my_pattern = re.compile('^\s ', re.MULTILINE)

# the following:
re.sub(my_pattern, '', my_string)
# returns:
'Red: 80%\nNavy Blue: 15%\nGreen:  3%\nPastels: Pink, Baby Blue, Lavender:  2%'

CodePudding user response：

In order to remove only horizontal whitespace chars from the start of each line, you can use

my_pattern = re.compile(r'(?m)^[^\S\r\n] ')
my_pattern = re.compile(r'^[^\S\r\n] ', re.M)
my_pattern = re.compile(r'^[^\S\r\n] ', re.MULTILINE)
# and then use my_pattern.sub:
text = my_pattern.sub('', text)

Note the (?m) inline modifier flag is equivalent to re.M option, it is handy when you can use a regex in some function/method that is defined in some linked library, and you do not want to import re module to just be able to use the flag.

Details:

^ - start of a line
[^\S\r\n] - one or more ( ) occurrences of any char but ([^...] is a negated character class) a CR (carriage return, \r), LF (line feed, \n) and non-whitespace char (\S). So, this is the same as \s with LF and CR chars subtracted from it.

See the regex demo.