How to HTML parse a URL list using python-CodePudding

I have a URL list of 5 URLs within a .txt file named as URLlist.txt.

https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp

I need to parse all the HTML content within the 5 URLs one by one for further processing.

My current code to parse an individual URL -

import requests from bs4 
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)   
print(soup.prettify())

CodePudding user response：

Create a list of your links

with open('test.txt', 'r') as f:
    urls = [line.strip() for line in f]

Then u can loop your parse

for url in urls:
    r = requests.get(url)
    ...

CodePudding user response：

Your problem will be solved using line-by-line readying and then put that line in your request. sample:

import requests from bs4
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

f = open("URLlist.txt", "r")
for line in f:
    print(line) # CURRENT LINE
    r = requests.get(line)
    soup = bs(r.content)
    print(soup.prettify())

CodePudding user response：

The way you implement this rather depends on whether you need to process the URLs iteratively or whether it's better to gather all the content for subsequent processing. That's what I suggest. Build a dictionary where each key is a URL and the associated value is the text (HTML) return from the page. Use multithreading for greater efficiency.

import requests
from concurrent.futures import ThreadPoolExecutor

data = dict()

def readurl(url):
    try:
        (r := requests.get(url)).raise_for_status()
        data[url] = r.text
    except Exception:
        pass

def main():
    with open('urls.txt') as infile:
        with ThreadPoolExecutor() as executor:
            executor.map(readurl, map(str.strip, infile.readlines()))
    print(data)

if __name__ == '__main__':
    main()