I need a regex function to recognize a hashtag in a sentence, split the words and numbers in the hashtag and put the word 'hashtag' behind the hashtag. For example:
- Input:
#MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST
- Output:
#hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T
As you can see the words need to be split after before every capital and every number. However, 2015
can not be 2 0 1 5
.
I already have the following:
r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 "
With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST
I already have the following:
document = re.sub(r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 ", document)
With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST
.
CodePudding user response:
You can use
import re
text = "#MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST"
print( re.sub(r'#(\w )', lambda x: '#hashtag ' re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)), text) )
# => #hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T
See the Python demo.
The #(\w )
regex used with the first re.sub
matches a #
any one or more word chars captured into Group 1.
The re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1))
part takes the Group 1 value as input and inserts a space between a non-digit and a digit, a digit and a non-digit and before a non-initial uppercase letter.