Home > other >  How to split a text in words, retain punctuation marks but without symbol: " ' "
How to split a text in words, retain punctuation marks but without symbol: " ' "

Time:03-29

current code:

txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"\w |\W ", txt))

output:

['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT', "'", 'S', ' ', 'WATCH', '.']

desired output:

['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT'S', ' ', 'WATCH', '.']

CodePudding user response:

Try this:

txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"[\w |\']*|\W ", txt))

CodePudding user response:

You need to use a character set.

You can accomplish this by using brackets [ ]. When using a character set, one of the characters in the set will be matched.


As you want either a word character or ', you should use:

[\w'] |\W 
  • [ ]: A character set, matches one of the following options.
    • \w: A word character (the same as [a-zA-Z0-9_]).
    • ': The symbol ', there is no need to escape it.

print(re.findall(r"[\w'] |\W ", txt))
# ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']

CodePudding user response:

You just need to explore regex a bit more

>>> print(re.findall(r"[a-zA-Z\'] ", txt))
['Jeor', 'MORMONT', 'Lord', 'COMMANDER', 'of', 'the', "NIGHT'S", 'WATCH']
>>>

Update:

>>> import re
>>>
>>> txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
>>>
>>> required = ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT\'S', ' ', 'WATCH', '.']
>>>
>>> bag = re.findall(r'[a-zA-Z\'] |[\ ,] |[\.]', txt)
>>>
>>> print(bag)
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']
>>> print(bag == required)
True
>>>

Comment here if I missed something.

  • Related