As the title states, I am trying to use regex to split the trademark ™ symbol from a string. I am looking for two possible patterns:
- string™ --> expected result: string ™
or
- string™2 --> expected result: string ™ 2
I came up with the below pattern to check whether a string contains either potential option:
pattern = "[a-zA-Z0-9] [™]([0-9])?$"
Is there any way to add some functionality to split it to end up with the expected results mentioned above?
CodePudding user response:
I'd do it with re.sub
in two steps. First add space from the left side where necessary and then from the right side:
import re
s = """\
string™
string™2
test test string™9 test test test"""
s = re.sub(r"([a-zA-Z0-9])™", r"\1 ™", s)
s = re.sub(r"™([0-9])", r"™ \1", s)
print(s)
Prints:
string ™
string ™ 2
test test string ™ 9 test test test
CodePudding user response:
re.split
will do the job, just give us all the information in the question.
Below uses a list comprehension to remove the splits caused by spaces and the extra empty string split when TM is at the end of the string:
import re
trials = 'string™', 'string™2', 'test test string™9 test test test'
for trial in trials:
result = [x for x in re.split(' |(™)', trial) if x]
print(f'{result!r} {" ".join(result)!r}')
Output:
['string', '™'] 'string ™'
['string', '™', '2'] 'string ™ 2'
['test', 'test', 'string', '™', '9', 'test', 'test', 'test'] 'test test string ™ 9 test test test'
CodePudding user response:
I would simply do it by adding one space before and after the ™
character and removing a potential right space at the end afterwards.
import re
text = ("string™", "string™2", "test test string™9 test test test")
pattern = re.compile(r"([^ ])(™)(.*)")
for t in text:
print(re.sub(pattern, r"\1 \2 \3", t).rstrip())
# Outputs:
# --------
# string ™
# string ™ 2
# test test string ™ 9 test test test
If you're really looking for numbers after the trademark symbol, simply replace the dot with [0-9]
.
But honestly, why using regex for this task, at all? A simple string replacement is also sufficient:
for t in text:
print(t.replace("™", " ™ ").rstrip())
Less/no dependencies, better readability, better testability, better maintenance, imho.