I have a list which can contain domains in a variety of ways:
[‘website.com’,
‘example.com’,
‘www.example.com’,
‘http://example.com’,
‘http://www.example.com’,
‘https://example..com’
]
How is it possible in regex or in Python to only get a list containing:
[‘website.com’,‘example.com’]
As you can see, all these examples of example.com and www.example.com have the same domain name, so I don’t need different variety of domains for my work.
Thanks
CodePudding user response:
Loop over each item in list and apply custom url filtering regex pattern to get desired items
CodePudding user response:
urlparse from urllib.parse is ideal for this. For example:
from urllib.parse import urlparse
URLS = ['website.com',
'example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com'
]
set_ = set()
for url in URLS:
uri = urlparse(url)
dom = uri.netloc or uri.path
set_.add(dom if not dom.startswith('www.') else dom[4:])
print(list(set_))
Output:
['website.com', 'example.com']
CodePudding user response:
Simple Python to handle your example urls (i.e. not more complex ones)
Code
def base_urls(lst):
' base urls from list of urls '
def base_url(target):
' base url from single '
prefixes = ['https://', 'http://', 'www.']
# Remove prefixes from urls
for prefix in prefixes:
if target.startswith(prefix):
target = target[len(prefix):] # remove prefix by skipping
return target
# Find unique base url by convert to set
# Then convert set to list
return list(set(base_url(target) for target in lst))
Test
yourlist = ['example.com',
'www.example.com',
'http://example.com',
'http://www.example.com',
'https://example.com', # correction of OP typo
'www.website.com' # added to OP list
]
print(base_urls(yourlist))
Output
['website.com', 'example.com']