I have a URL list of 5 URLs within a .txt file named as URLlist.txt.
https://www.w3schools.com/php/php_syntax.asp
https://www.w3schools.com/php/php_comments.asp
https://www.w3schools.com/php/php_variables.asp
https://www.w3schools.com/php/php_echo_print.asp
https://www.w3schools.com/php/php_datatypes.asp
I need to parse all the HTML content within the 5 URLs one by one for further processing.
My current code to parse an individual URL -
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
r = requests.get("https://www.w3schools.com/whatis/whatis_jquery.asp")
soup = bs(r.content)
print(soup.prettify())
CodePudding user response:
Create a list of your links
with open('test.txt', 'r') as f:
urls = [line.strip() for line in f]
Then u can loop your parse
for url in urls:
r = requests.get(url)
...
CodePudding user response:
Your problem will be solved using line-by-line readying and then put that line in your request. sample:
import requests from bs4
import BeautifulSoup as bs #HTML parsing using beatuifulsoup
f = open("URLlist.txt", "r")
for line in f:
print(line) # CURRENT LINE
r = requests.get(line)
soup = bs(r.content)
print(soup.prettify())
CodePudding user response:
The way you implement this rather depends on whether you need to process the URLs iteratively or whether it's better to gather all the content for subsequent processing. That's what I suggest. Build a dictionary where each key is a URL and the associated value is the text (HTML) return from the page. Use multithreading for greater efficiency.
import requests
from concurrent.futures import ThreadPoolExecutor
data = dict()
def readurl(url):
try:
(r := requests.get(url)).raise_for_status()
data[url] = r.text
except Exception:
pass
def main():
with open('urls.txt') as infile:
with ThreadPoolExecutor() as executor:
executor.map(readurl, map(str.strip, infile.readlines()))
print(data)
if __name__ == '__main__':
main()