Home > Mobile >  Regex for capturing all the urls in a paragraph except for a specific domain
Regex for capturing all the urls in a paragraph except for a specific domain

Time:05-26

I need to capture all the urls in a paragraph apart from the urls from a specific domain/ sub domain.For example in the below paragraph I need to capture all the urls apart from example.com

"This is a paragraph name.url.com it contains random urls name-dev.url.com name-qa.url.com www.example.com test.example.com http://TestCaSeSensetivEUrl.com http://www.test.com https://www.example.com test.com"

Urls I need to capture

Urls I don't need to capture as below

I have tried the below regex using negative look behind method, but it's not working as I need.

/(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?([a-z0-9] (?<!example)[\-\.]{1}[a-z0-9 ] (?<!example)\.[a-z]{2,5})/gi

CodePudding user response:

This should be sufficient for your use case:

/(https?:\/\/)?[A-Za-z0-9\-\.]*^(?:(?!example).)*$/gm

A great resource to use for your testing is https://regex101.com/, if you paste this regex into the site as well as your URLs you can see how it works and what it matches exactly. Below is a rough explanation of how this works:

  • Not matching a string is done by the ^(?:(?!example).) segment, which can be explained here. The ?: portion is to make the parenthesis a non-capture group to speed up your query, since you are not interested in using the capture group.
    • The caret operator (^) negates the next character/character set, in this case our capture group of parenthesis
  • (https?:\/\/)? matches an optional preface of http:// or https://
  • [A-Za-z0-9\-\.]* matches any text that contains letters (uppercase and lower case), numbers, dashes, and periods.

CodePudding user response:

this could be a solution

^(https?:\/\/)?(?!(?:www\.)?google\.*)([\da-zA-Z.-] )\.([a-zA-Z\.]{2,6})([\/\w .-]*)*\/?$

for example here google is excluded from being captured

CodePudding user response:

I think this regex works for you:

/[a-z-0-9.\/:] (?<!example)\.com/gi

It gets all the URLS you listed: It works by picking letters from a to z, numbers from 0 to 9 and characters like ".", "/", "-" and ":" repeated one or more times.

With that you get the entire url, and now we do a negative lookbehind to assure "example" is not before ".com", and if it is not, the url matches the regex.

  • Related