I want to extract the domain name from the URL. which will also include wildcard characters. Apart from the wildcard card, it should a valid domain.
Acceptable Domains
https://*.google.com => *.google.com
http://*.google.com => *.google.com
*google.com => *google.com
google.com => google.com
any-google.com => any-google.com
www.google.com => www.google.com
https://google.com/something => google.com
google.com/something => google.com
google.com/something?a=23&b=3 => google.com
http://google.com/something?a=23&b=3 => google.com
google.com/something?a=23&b=3#some => google.com
https://google.com/something?a=23&b=3#some => google.com
Non-Acceptable Domain
http://**.google.com
*.*.google.com
google.*com
goo**le.com
google.*com
google.com*
google--.com
google..com
google-s.com
goolge/$#$
<all invalid URL>
Note: In the above example, only google domain is given for example. But it could be any domain.
I tried using C# System.Uri it fails when there is wildcard character (*). Even RegExp based solution seems to give more false positive or false negative result.
private static string ExtractDomainFromUrl(string url)
{
if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
{
return new Uri(url, UriKind.Absolute).Host;
}
return null;
}
The above solution fails when the input URL doesn't start with HTTP or HTTPS. Also, It fails when the input has wildcard character (i.e *.google.com).
CodePudding user response:
There are multiple questions to be answered here. First, how do you tell a URL from a domain?
var uriRel = new Uri(url, UriKind.RelativeOrAbsolute);
if(!uriRel.IsAbsoluteUri) url = "http://" url;
I am not sure if it's a good practice to treat a relative URI like it's just missing the scheme, depending on how you obtain such a URI, but I assume it's fine in your case. You might also need to handle the "starting with //
" case and other cases where it gets parsed as relative but does not look like a domain.
Next, how do you allow the *
character? You cannot, but you can certainly replace it!
string replacement;
for(int i = 0; ; i )
{
replacement = "w" i;
if(!url.Contains(replacement))
{
break;
}
}
var uriObj = new Uri(url.Replace("*", replacement), UriKind.Absolute);
var host = uriObj.IdnHost.Replace(replacement, "*");
This just tries to find the first URI-valid string that is not contain in the input, and use it when replacing the *
both ways.
The last question is how to validate the wildcarded domain if you successfully obtained it. You didn't specify what the actual rules are, so I suppose you intend to implement that yourself.
In all cases, don't forget to catch the UriFormatException.