I need to match all words from a tweet which are not urls (starting with https), hashtags (starting with #) or other characters such as .,;/
I am using
(?!https|[\\t])\b[aA-zZ]
but it is not working as expected.
Is there any other way to extract only words not starting with special characters or utls?
CodePudding user response:
I a lookbehind assertion is supported, you might use:
(?<!\S)(?!https?:\/\/)\w\S*
Explanation
(?<!\S)
Negative lookbehind, assert a whitespace boundary to the left(?!https?:\/\/)
Negative lookahead, assert not http:// or https:// directly to the right\w\S*
Match a single word character followed by optional non whitespace characters
See a regex demo.
Without a lookbehind and a capture group:
(?:\s|^)(?!https?:\/\/)(\w\S*)
See another regex demo.