current code:
txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"\w |\W ", txt))
output:
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT', "'", 'S', ' ', 'WATCH', '.']
desired output:
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT'S', ' ', 'WATCH', '.']
CodePudding user response:
Try this:
txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
print(re.findall(r"[\w |\']*|\W ", txt))
CodePudding user response:
You need to use a character set
.
You can accomplish this by using brackets [ ]
. When using a character set, one of the characters in the set will be matched.
As you want either a word character or '
, you should use:
[\w'] |\W
[ ]
: A character set, matches one of the following options.\w
: A word character (the same as[a-zA-Z0-9_]
).'
: The symbol'
, there is no need to escape it.
print(re.findall(r"[\w'] |\W ", txt))
# ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']
CodePudding user response:
You just need to explore regex a bit more
>>> print(re.findall(r"[a-zA-Z\'] ", txt))
['Jeor', 'MORMONT', 'Lord', 'COMMANDER', 'of', 'the', "NIGHT'S", 'WATCH']
>>>
Update:
>>> import re
>>>
>>> txt = "Jeor MORMONT, Lord COMMANDER of the NIGHT'S WATCH."
>>>
>>> required = ['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', 'NIGHT\'S', ' ', 'WATCH', '.']
>>>
>>> bag = re.findall(r'[a-zA-Z\'] |[\ ,] |[\.]', txt)
>>>
>>> print(bag)
['Jeor', ' ', 'MORMONT', ', ', 'Lord', ' ', 'COMMANDER', ' ', 'of', ' ', 'the', ' ', "NIGHT'S", ' ', 'WATCH', '.']
>>> print(bag == required)
True
>>>
Comment here if I missed something.