C# Extract wildcard domain from Url-CodePudding

I want to extract the domain name from the URL. which will also include wildcard characters. Apart from the wildcard card, it should a valid domain.

Acceptable Domains

https://*.google.com   => *.google.com
http://*.google.com    => *.google.com
*google.com            => *google.com
google.com             => google.com
any-google.com         => any-google.com
www.google.com         => www.google.com
https://google.com/something                => google.com
google.com/something                        => google.com
google.com/something?a=23&b=3               => google.com
http://google.com/something?a=23&b=3        => google.com
google.com/something?a=23&b=3#some          => google.com
https://google.com/something?a=23&b=3#some  => google.com

Non-Acceptable Domain

http://**.google.com
*.*.google.com
google.*com
goo**le.com
google.*com
google.com*
google--.com
google..com
google-s.com
goolge/$#$
<all invalid URL>

Note: In the above example, only google domain is given for example. But it could be any domain.

I tried using C# System.Uri it fails when there is wildcard character (*). Even RegExp based solution seems to give more false positive or false negative result.

private static string ExtractDomainFromUrl(string url)
{
            if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
            {
                return new Uri(url, UriKind.Absolute).Host;
            }

            return null;
}

The above solution fails when the input URL doesn't start with HTTP or HTTPS. Also, It fails when the input has wildcard character (i.e *.google.com).

CodePudding user response：

There are multiple questions to be answered here. First, how do you tell a URL from a domain?

var uriRel = new Uri(url, UriKind.RelativeOrAbsolute);
if(!uriRel.IsAbsoluteUri) url = "http://"   url;

I am not sure if it's a good practice to treat a relative URI like it's just missing the scheme, depending on how you obtain such a URI, but I assume it's fine in your case. You might also need to handle the "starting with //" case and other cases where it gets parsed as relative but does not look like a domain.

Next, how do you allow the * character? You cannot, but you can certainly replace it!

string replacement;
for(int i = 0; ; i  )
{
    replacement = "w"   i;
    if(!url.Contains(replacement))
    {
        break;
    }
}
var uriObj = new Uri(url.Replace("*", replacement), UriKind.Absolute);
            
var host = uriObj.IdnHost.Replace(replacement, "*");

This just tries to find the first URI-valid string that is not contain in the input, and use it when replacing the * both ways.

The last question is how to validate the wildcarded domain if you successfully obtained it. You didn't specify what the actual rules are, so I suppose you intend to implement that yourself.

In all cases, don't forget to catch the UriFormatException.