I know there are many solutions, articles and libraries for this case, but couldn't find one to match my case. I'm trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:
- Could contain http(s):// , or not
- Could contain www. , or not
- Could have multiple TLD such as "test.com.cn"
Here are some examples:
www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn
I've come up with the following regex:
(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?$
But there are two main problems with this, because the signature can contain an email address:
- It (wrongly) capture the TLDs of emails like this one: [email protected]
- It doesn't capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the
name.surname
part of the last example
For (1) I tried using negative lookbehind
, adding this (?<!@)
to the beginning, the problem is that now it captures est2.com
instead of not matching it at all.
CodePudding user response:
I think you could use \b
(boundary) instead of $
(and at the beginning as well) and exclude @
in negative lookbehind and lookahead:
(?<!@|\.|-)\b(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?\b(?!@|\.|-)
Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename
in [email protected]
or com.cn
in [email protected]
. See this answer for the list of characters