I have a list of URL's stored in a text file called 'urls.txt' and I want to remove the prefix of each link so we get just the domain name from that URL. example: https://www.stackoverflow.com becomes stackoverflow.com etc..
*** problem: My code removes only the prefix of the first URL and keeps the rest in the list as they were.***
My python code:
filename = 'urls.txt'
prefix = 'htps:/w.'
url_list = open(filename)
links = url_list.read()
single_url = links.strip('/n') # remove the /n at the end of each url
domain = single_url.lstrip(prefix)
print(domain)
What we should do here?
CodePudding user response:
urllib is your friend here:
from urllib.parse import urlparse
with open('urls.txt') as urls:
for line in map(str.strip, urls):
print(urlparse(line).netloc)
Bear in mind that the domain won't necessarily start with 'www' but you can check for that and handle it separately if necessary
CodePudding user response:
A few problems with your code:
links
is a string, not a list. If your file contains each url on a single line and you want a list, read your file as a list of lines usingreadlines()
or iterate over the file handleurl_list
.- The newline character is
\n
not/n
(note the direction of the slash) - Reconsider using
lstrip()
. Why? See what happens when you try"https://www.helloworld.org".lstrip(prefix)
. Instead I suggest usingre.sub()
So, with all these:
import re
urls = []
with open(file_name) as url_file:
for line in url_file:
url = re.sub("^(https?://)?(www\.)?", "", line.strip())
urls.append(url)
Regex explanation:
^(https?://)?(www\.)?
^ : Match start of line
http : Literally http
s? : Zero or one s
:// : Literally ://
( )? : The group enclosed in parentheses is optional
www\. : www, then a literal period
( )? : The group enclosed in parentheses is optional
Anything that matches this regex is substituted with ""
. With the input file
http://www.website.com
https://www.website.com
http://website.com
https://website.com
www.website.com
this code gives
urls = ['website.com', 'website.com', 'website.com', 'website.com', 'website.com']
All of this can be written as a list comprehension:
with open(file_name) as url_file:
urls = [re.sub("^(https?://)?(www\.)?", "", line.strip()) for line in url_file]