Home > Software engineering >  Get unique domain names from a list of domains
Get unique domain names from a list of domains

Time:07-04

I have a list which can contain domains in a variety of ways:

[‘website.com’,
‘example.com’,
‘www.example.com’,
‘http://example.com’,
‘http://www.example.com’,
‘https://example..com’
]

How is it possible in regex or in Python to only get a list containing:

[‘website.com’,‘example.com’]

As you can see, all these examples of example.com and www.example.com have the same domain name, so I don’t need different variety of domains for my work.

Thanks

CodePudding user response:

Loop over each item in list and apply custom url filtering regex pattern to get desired items

CodePudding user response:

urlparse from urllib.parse is ideal for this. For example:

from urllib.parse import urlparse

URLS = ['website.com',
'example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com'
]

set_ = set()

for url in URLS:
    uri = urlparse(url)
    dom = uri.netloc or uri.path
    set_.add(dom if not dom.startswith('www.') else dom[4:])

print(list(set_))

Output:

['website.com', 'example.com']

CodePudding user response:

Simple Python to handle your example urls (i.e. not more complex ones)

Code

def base_urls(lst):
    ' base urls from list of urls '
    
    def base_url(target):
        ' base url from single '
        
        prefixes = ['https://', 'http://', 'www.']
        
        # Remove prefixes from urls
        for prefix in prefixes:
            if target.startswith(prefix):
                target = target[len(prefix):]  # remove prefix by skipping

        return target
    
    # Find unique base url by convert to set
    # Then convert set to list
    return list(set(base_url(target) for target in lst))
            

Test

yourlist = ['example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com',                # correction of OP typo
'www.website.com'                     # added to OP list
]

print(base_urls(yourlist))

Output

['website.com', 'example.com']
  • Related