I'm trying to decompose twitter hashtags in order to extract the words that compose it. I'm having trouble finding a regular expression that can do this satisfactorily, mainly due to the authors' "excessive creativity" in capitalization.
Some examples:
#itsAHashtag -> ['its', 'a', 'hashtag']
#GlazersOutNOW -> ['glazers', 'out', 'now']
#COVIDIsNotOver -> ['covid', 'is', 'not', 'over']
Is there any library that does this kind of decomposition?
CodePudding user response:
Based upon the samples you provided, this regex should work for you,
(?:[A-Z] |[a-zA-Z][a-z] ?)(?=[A-Z]|$)
And let me know if this works. I will add explanation if it works well.
CodePudding user response:
You could use a combination of capital letter split and a set of English words to compare with. The module english-words looks promising.