using regex to split TM symbol from string?-CodePudding

As the title states, I am trying to use regex to split the trademark ™ symbol from a string. I am looking for two possible patterns:

string™ --> expected result: string ™

string™2 --> expected result: string ™ 2

I came up with the below pattern to check whether a string contains either potential option:

pattern = "[a-zA-Z0-9] [™]([0-9])?$"

Is there any way to add some functionality to split it to end up with the expected results mentioned above?

CodePudding user response：

I'd do it with re.sub in two steps. First add space from the left side where necessary and then from the right side:

import re

s = """\
string™
string™2
test test string™9 test test test"""


s = re.sub(r"([a-zA-Z0-9])™", r"\1 ™", s)
s = re.sub(r"™([0-9])", r"™ \1", s)

print(s)

Prints:

string ™
string ™ 2
test test string ™ 9 test test test

CodePudding user response：

re.split will do the job, just give us all the information in the question.

Below uses a list comprehension to remove the splits caused by spaces and the extra empty string split when TM is at the end of the string:

import re

trials = 'string™', 'string™2', 'test test string™9 test test test'

for trial in trials:
    result = [x for x in re.split(' |(™)', trial) if x]
    print(f'{result!r} {" ".join(result)!r}')

Output:

['string', '™'] 'string ™'
['string', '™', '2'] 'string ™ 2'
['test', 'test', 'string', '™', '9', 'test', 'test', 'test'] 'test test string ™ 9 test test test'

CodePudding user response：

I would simply do it by adding one space before and after the ™ character and removing a potential right space at the end afterwards.

import re

text = ("string™", "string™2", "test test string™9 test test test")
pattern = re.compile(r"([^ ])(™)(.*)")

for t in text:
    print(re.sub(pattern, r"\1 \2 \3", t).rstrip())

# Outputs:
# --------
# string ™
# string ™ 2
# test test string ™ 9 test test test

If you're really looking for numbers after the trademark symbol, simply replace the dot with [0-9].

But honestly, why using regex for this task, at all? A simple string replacement is also sufficient:

for t in text:
    print(t.replace("™", " ™ ").rstrip())

Less/no dependencies, better readability, better testability, better maintenance, imho.