Home > Net >  How to HTML parse a URL list using python
How to HTML parse a URL list using python

Time:03-22

I have a URL list of 5 URLs within a .txt file named as URLlist.txt.

https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp

I need to parse all the HTML content within the 5 URLs one by one for further processing.

My current code to parse an individual URL -

import requests from bs4 
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)   
print(soup.prettify())

CodePudding user response:

Create a list of your links

with open('test.txt', 'r') as f:
    urls = [line.strip() for line in f]

Then u can loop your parse

for url in urls:
    r = requests.get(url)
    ...

CodePudding user response:

Your problem will be solved using line-by-line readying and then put that line in your request. sample:

import requests from bs4
import BeautifulSoup as bs   #HTML parsing using beatuifulsoup

f = open("URLlist.txt", "r")
for line in f:
    print(line) # CURRENT LINE
    r = requests.get(line)
    soup = bs(r.content)
    print(soup.prettify())

CodePudding user response:

The way you implement this rather depends on whether you need to process the URLs iteratively or whether it's better to gather all the content for subsequent processing. That's what I suggest. Build a dictionary where each key is a URL and the associated value is the text (HTML) return from the page. Use multithreading for greater efficiency.

import requests
from concurrent.futures import ThreadPoolExecutor

data = dict()

def readurl(url):
    try:
        (r := requests.get(url)).raise_for_status()
        data[url] = r.text
    except Exception:
        pass

def main():
    with open('urls.txt') as infile:
        with ThreadPoolExecutor() as executor:
            executor.map(readurl, map(str.strip, infile.readlines()))
    print(data)

if __name__ == '__main__':
    main()
  • Related