Home > Back-end >  Match all words regex expression excepting urls
Match all words regex expression excepting urls

Time:12-23

I need to match all words from a tweet which are not urls (starting with https), hashtags (starting with #) or other characters such as .,;/

I am using

(?!https|[\\t])\b[aA-zZ] 

but it is not working as expected.

Is there any other way to extract only words not starting with special characters or utls?

CodePudding user response:

I a lookbehind assertion is supported, you might use:

(?<!\S)(?!https?:\/\/)\w\S*

Explanation

  • (?<!\S) Negative lookbehind, assert a whitespace boundary to the left
  • (?!https?:\/\/) Negative lookahead, assert not http:// or https:// directly to the right
  • \w\S* Match a single word character followed by optional non whitespace characters

See a regex demo.

Without a lookbehind and a capture group:

(?:\s|^)(?!https?:\/\/)(\w\S*)

See another regex demo.

  • Related