Home > OS >  How to decompose twitter hashtags into words?
How to decompose twitter hashtags into words?

Time:08-14

I'm trying to decompose twitter hashtags in order to extract the words that compose it. I'm having trouble finding a regular expression that can do this satisfactorily, mainly due to the authors' "excessive creativity" in capitalization.

Some examples:

#itsAHashtag -> ['its', 'a', 'hashtag']
#GlazersOutNOW -> ['glazers', 'out', 'now']
#COVIDIsNotOver -> ['covid', 'is', 'not', 'over']

Is there any library that does this kind of decomposition?

CodePudding user response:

Based upon the samples you provided, this regex should work for you,

(?:[A-Z] |[a-zA-Z][a-z] ?)(?=[A-Z]|$)

Check this demo

And let me know if this works. I will add explanation if it works well.

CodePudding user response:

You could use a combination of capital letter split and a set of English words to compare with. The module english-words looks promising.

  • Related