Would like to add a space in between different characters. The one I have now adds it in-between every character, whereas I would like to omit that if the characters are the same (consecutive).
import re
string = "CEEEETTEEEEGGGCCBCTTBHHHHHCCEEEEEEEEETTEETT"
space = re.compile(r"(?<!^)(?=[CEHBGITS])(?!$)")
print(space.sub("\1", string))
Expected result is: C EEEE TT EEEE GGG CC B C TT B HHHHH CC EEEEEEEEE TT EE TT
CodePudding user response:
You can use
re.sub(r'(?<=(.))(?!\1)(?=.)', ' ', str)
Python demo <-\(ツ)/-> Regex demo
The regular expression has the following elements.
(?<=(.)) # positive lookbehind asserts that the current string location
# is preceded by a character that is saved to capture group 1
(?!\1) # negative lookahead asserts that the current string
# location is not followed by the content of capture group 1
(?=.) # positive lookahead asserts that current string location
# is followed by a character
The current string location can be thought of as a position between consecutive characters.
CodePudding user response:
Another option is to match the consecutive character using a capture group and a backreference, and in the replacement use the full match denoted by \g<0>
predeced by a space.
To prevent a leading space, you can assert that there is at least a preceding character.
(?<=[CEHBGITS])([CEHBGITS])\1*
In parts, the pattern matches:
(?<=
Positive lookbehind, assert what is directly of left the current position is[CEHBGITS]
Match 1 of the lister characters
)
Close lookbehind([CEHBGITS])
Capture group 1, match 1 of the listed characters\1*
Optionally repeat what is captured in group 1 using a backreference
See a regex demo and a Python demo.
For example
import re
s = "CEEEETTEEEEGGGCCBCTTBHHHHHCCEEEEEEEEETTEETT"
pattern = r"(?<=[CEHBGITS])([CEHBGITS])\1*"
print(re.sub(pattern, " \g<0>", s))
Output
C EEEE TT EEEE GGG CC B C TT B HHHHH CC EEEEEEEEE TT EE TT
A shorter version matching non whitespace characters:
(?<=\S)(\S)\1*