Regex for inserting space after different characters-CodePudding

Would like to add a space in between different characters. The one I have now adds it in-between every character, whereas I would like to omit that if the characters are the same (consecutive).

import re
string = "CEEEETTEEEEGGGCCBCTTBHHHHHCCEEEEEEEEETTEETT"
space = re.compile(r"(?<!^)(?=[CEHBGITS])(?!$)")
print(space.sub("\1", string))

Expected result is: C EEEE TT EEEE GGG CC B C TT B HHHHH CC EEEEEEEEE TT EE TT

CodePudding user response：

You can use

re.sub(r'(?<=(.))(?!\1)(?=.)', ' ', str)

Python demo ^_<-_\(ツ)/^_-> Regex demo

The regular expression has the following elements.

(?<=(.)) # positive lookbehind asserts that the current string location
         # is preceded by a character that is saved to capture group 1
(?!\1)   # negative lookahead asserts that the current string
         # location is not followed by the content of capture group 1
(?=.)    # positive lookahead asserts that current string location
         # is followed by a character

The current string location can be thought of as a position between consecutive characters.

CodePudding user response：

Another option is to match the consecutive character using a capture group and a backreference, and in the replacement use the full match denoted by \g<0> predeced by a space.

To prevent a leading space, you can assert that there is at least a preceding character.

(?<=[CEHBGITS])([CEHBGITS])\1*

In parts, the pattern matches:

(?<= Positive lookbehind, assert what is directly of left the current position is
- [CEHBGITS] Match 1 of the lister characters
) Close lookbehind
([CEHBGITS]) Capture group 1, match 1 of the listed characters
\1* Optionally repeat what is captured in group 1 using a backreference

See a regex demo and a Python demo.

For example

import re
s = "CEEEETTEEEEGGGCCBCTTBHHHHHCCEEEEEEEEETTEETT"
pattern = r"(?<=[CEHBGITS])([CEHBGITS])\1*"
print(re.sub(pattern, " \g<0>", s))

Output

C EEEE TT EEEE GGG CC B C TT B HHHHH CC EEEEEEEEE TT EE TT

A shorter version matching non whitespace characters:

(?<=\S)(\S)\1*

Regex demo