I need to capture all the urls in a paragraph apart from the urls from a specific domain/ sub domain.For example in the below paragraph I need to capture all the urls apart from example.com
"This is a paragraph name.url.com it contains random urls name-dev.url.com name-qa.url.com www.example.com test.example.com http://TestCaSeSensetivEUrl.com http://www.test.com https://www.example.com test.com"
Urls I need to capture
- name.url.com
- name-dev.url.com
- name-qa.url.com
- http://TestCaSeSensetivEUrl.com
- http://www.test.com
- test.com
Urls I don't need to capture as below
test.example.com
I have tried the below regex using negative look behind method, but it's not working as I need.
/(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?([a-z0-9] (?<!example)[\-\.]{1}[a-z0-9 ] (?<!example)\.[a-z]{2,5})/gi
CodePudding user response:
This should be sufficient for your use case:
/(https?:\/\/)?[A-Za-z0-9\-\.]*^(?:(?!example).)*$/gm
A great resource to use for your testing is https://regex101.com/, if you paste this regex into the site as well as your URLs you can see how it works and what it matches exactly. Below is a rough explanation of how this works:
- Not matching a string is done by the
^(?:(?!example).)
segment, which can be explained here. The?:
portion is to make the parenthesis a non-capture group to speed up your query, since you are not interested in using the capture group.- The caret operator (
^
) negates the next character/character set, in this case our capture group of parenthesis
- The caret operator (
(https?:\/\/)?
matches an optional preface ofhttp://
orhttps://
[A-Za-z0-9\-\.]*
matches any text that contains letters (uppercase and lower case), numbers, dashes, and periods.
CodePudding user response:
this could be a solution
^(https?:\/\/)?(?!(?:www\.)?google\.*)([\da-zA-Z.-] )\.([a-zA-Z\.]{2,6})([\/\w .-]*)*\/?$
for example here google is excluded from being captured
CodePudding user response:
I think this regex works for you:
/[a-z-0-9.\/:] (?<!example)\.com/gi
It gets all the URLS you listed: It works by picking letters from a to z, numbers from 0 to 9 and characters like ".", "/", "-" and ":" repeated one or more times.
With that you get the entire url, and now we do a negative lookbehind to assure "example" is not before ".com", and if it is not, the url matches the regex.