Remove the prefix of each URL from a list of URL's in Python-CodePudding

I have a list of URL's stored in a text file called 'urls.txt' and I want to remove the prefix of each link so we get just the domain name from that URL. example: https://www.stackoverflow.com becomes stackoverflow.com etc..

*** problem: My code removes only the prefix of the first URL and keeps the rest in the list as they were.***

My python code:

filename = 'urls.txt'
prefix = 'htps:/w.'
url_list = open(filename)
links = url_list.read()

single_url = links.strip('/n')  # remove the /n at the end of each url
domain = single_url.lstrip(prefix)

print(domain)

What we should do here?

CodePudding user response：

urllib is your friend here:

from urllib.parse import urlparse

with open('urls.txt') as urls:
    for line in map(str.strip, urls):
        print(urlparse(line).netloc)

Bear in mind that the domain won't necessarily start with 'www' but you can check for that and handle it separately if necessary

CodePudding user response：

A few problems with your code:

links is a string, not a list. If your file contains each url on a single line and you want a list, read your file as a list of lines using readlines() or iterate over the file handle url_list.
The newline character is \n not /n (note the direction of the slash)
Reconsider using lstrip(). Why? See what happens when you try "https://www.helloworld.org".lstrip(prefix). Instead I suggest using re.sub()

So, with all these:

import re
urls = []

with open(file_name) as url_file:
    for line in url_file:
        url = re.sub("^(https?://)?(www\.)?", "", line.strip())
        urls.append(url)

Try the regex online

Regex explanation:

^(https?://)?(www\.)?

^                      : Match start of line
  http                 : Literally http
      s?               : Zero or one s
        ://            : Literally ://
 (         )?          : The group enclosed in parentheses is optional
              www\.    : www, then a literal period
             (     )?  : The group enclosed in parentheses is optional

Anything that matches this regex is substituted with "". With the input file

http://www.website.com
https://www.website.com
http://website.com
https://website.com
www.website.com

this code gives

urls =  ['website.com', 'website.com', 'website.com', 'website.com', 'website.com']

All of this can be written as a list comprehension:

with open(file_name) as url_file:
    urls = [re.sub("^(https?://)?(www\.)?", "", line.strip()) for line in url_file]