Home > Software design >  Regex for URL without path
Regex for URL without path

Time:01-23

I know there are many solutions, articles and libraries for this case, but couldn't find one to match my case. I'm trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:

  • Could contain http(s):// , or not
  • Could contain www. , or not
  • Could have multiple TLD such as "test.com.cn"

Here are some examples:

www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn

I've come up with the following regex:

(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?$

But there are two main problems with this, because the signature can contain an email address:

  1. It (wrongly) capture the TLDs of emails like this one: [email protected]
  2. It doesn't capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the name.surname part of the last example

For (1) I tried using negative lookbehind, adding this (?<!@) to the beginning, the problem is that now it captures est2.com instead of not matching it at all.

CodePudding user response:

I think you could use \b (boundary) instead of $ (and at the beginning as well) and exclude @ in negative lookbehind and lookahead:

(?<!@|\.|-)\b(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?\b(?!@|\.|-)

Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename in [email protected] or com.cn in [email protected]. See this answer for the list of characters

  • Related