Home > database >  using regex to split TM symbol from string?
using regex to split TM symbol from string?

Time:11-29

As the title states, I am trying to use regex to split the trademark ™ symbol from a string. I am looking for two possible patterns:

  1. string™ --> expected result: string ™

or

  1. string™2 --> expected result: string ™ 2

I came up with the below pattern to check whether a string contains either potential option:

pattern = "[a-zA-Z0-9] [™]([0-9])?$"

Is there any way to add some functionality to split it to end up with the expected results mentioned above?

CodePudding user response:

I'd do it with re.sub in two steps. First add space from the left side where necessary and then from the right side:

import re

s = """\
string™
string™2
test test string™9 test test test"""


s = re.sub(r"([a-zA-Z0-9])™", r"\1 ™", s)
s = re.sub(r"™([0-9])", r"™ \1", s)

print(s)

Prints:

string ™
string ™ 2
test test string ™ 9 test test test

CodePudding user response:

re.split will do the job, just give us all the information in the question.

Below uses a list comprehension to remove the splits caused by spaces and the extra empty string split when TM is at the end of the string:

import re

trials = 'string™', 'string™2', 'test test string™9 test test test'

for trial in trials:
    result = [x for x in re.split(' |(™)', trial) if x]
    print(f'{result!r} {" ".join(result)!r}')

Output:

['string', '™'] 'string ™'
['string', '™', '2'] 'string ™ 2'
['test', 'test', 'string', '™', '9', 'test', 'test', 'test'] 'test test string ™ 9 test test test'

CodePudding user response:

I would simply do it by adding one space before and after the character and removing a potential right space at the end afterwards.

import re

text = ("string™", "string™2", "test test string™9 test test test")
pattern = re.compile(r"([^ ])(™)(.*)")

for t in text:
    print(re.sub(pattern, r"\1 \2 \3", t).rstrip())

# Outputs:
# --------
# string ™
# string ™ 2
# test test string ™ 9 test test test

If you're really looking for numbers after the trademark symbol, simply replace the dot with [0-9].

But honestly, why using regex for this task, at all? A simple string replacement is also sufficient:

for t in text:
    print(t.replace("™", " ™ ").rstrip())

Less/no dependencies, better readability, better testability, better maintenance, imho.

  • Related