I am trying to math all domain example
and the extension .com
(aka. top level domain) from a text that can include links but I fail completely as it matches the subdomain too and also things the domain is the extension sometimes.
Goal:
https://www.subdomain.example.com/folder/folder -> example.com
example.com/folder/folder -> example.com
www.subdomain.example.com/folder/folder -> example.com
example.com -> example.com
www.example.com -> example.com
subdomain.example.com -> example.com
Attempt 1:
(?:(?:www?).)?\b((xn--)?[a-z0-9] (-[a-z0-9] )*\.) [a-z]{2,}\b
Attempt 2:
(?:(?:https?|ftp):\/\/)?[\w/\-?=%.] \.[\w/\-&?=%.]
CodePudding user response:
Something like this may work or be a starter: https://regex101.com/r/1UMjML/1 (updated regex slightly)
regex: (?<=https?://)(?:\w \.) (?<domain>\w \.\w )[/\s$]
CodePudding user response:
A simple solution is to match anything followed by a tld:
\w \.com
You can then make it more explicit by padding the beginning and end with whatever you want to match on e.g.:
(?:https:\/\/.*?)?(\w \.com)