Home > database >  Regex function to split words and numbers in a hashtag in a sentence
Regex function to split words and numbers in a hashtag in a sentence

Time:12-15

I need a regex function to recognize a hashtag in a sentence, split the words and numbers in the hashtag and put the word 'hashtag' behind the hashtag. For example:

  • Input: #MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST
  • Output: #hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T

As you can see the words need to be split after before every capital and every number. However, 2015 can not be 2 0 1 5.

I already have the following:

r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 "

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST

I already have the following:

document = re.sub(r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 ", document)

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST.

CodePudding user response:

You can use

import re
text = "#MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST"
print( re.sub(r'#(\w )', lambda x: '#hashtag '   re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)), text) )
# => #hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T

See the Python demo.

The #(\w ) regex used with the first re.sub matches a # any one or more word chars captured into Group 1.

The re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)) part takes the Group 1 value as input and inserts a space between a non-digit and a digit, a digit and a non-digit and before a non-initial uppercase letter.

  • Related