Get unique domain names from a list of domains-CodePudding

I have a list which can contain domains in a variety of ways:

[‘website.com’,
‘example.com’,
‘www.example.com’,
‘http://example.com’,
‘http://www.example.com’,
‘https://example..com’
]

How is it possible in regex or in Python to only get a list containing:

[‘website.com’,‘example.com’]

As you can see, all these examples of example.com and www.example.com have the same domain name, so I don’t need different variety of domains for my work.

Thanks

CodePudding user response：

Loop over each item in list and apply custom url filtering regex pattern to get desired items

CodePudding user response：

urlparse from urllib.parse is ideal for this. For example:

from urllib.parse import urlparse

URLS = ['website.com',
'example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com'
]

set_ = set()

for url in URLS:
    uri = urlparse(url)
    dom = uri.netloc or uri.path
    set_.add(dom if not dom.startswith('www.') else dom[4:])

print(list(set_))

Output:

['website.com', 'example.com']

CodePudding user response：

Simple Python to handle your example urls (i.e. not more complex ones)

Code

def base_urls(lst):
    ' base urls from list of urls '
    
    def base_url(target):
        ' base url from single '
        
        prefixes = ['https://', 'http://', 'www.']
        
        # Remove prefixes from urls
        for prefix in prefixes:
            if target.startswith(prefix):
                target = target[len(prefix):]  # remove prefix by skipping

        return target
    
    # Find unique base url by convert to set
    # Then convert set to list
    return list(set(base_url(target) for target in lst))

Test

yourlist = ['example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com',                # correction of OP typo
'www.website.com'                     # added to OP list
]

print(base_urls(yourlist))

Output

['website.com', 'example.com']