I want to extract some information from multiple pages which have similar page structures. all URLs of the pages are saved in one file.txt (every URL in one line). I already create the code to scrape all the data from one link (it works).
But I don't know how I create a loop to go through all the list of URLs from the txt file, and scrape all the data.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from bs4 import Comment
import re
import rispy # Writing an ris file
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
CodePudding user response:
you are making a big mistake by writing :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
because that will store the html data of the last url obtained from the TXT file in html
variable.
after the for loop finish, the last line of the TXT file will be stored in variable url
and that mean you will get only the last url in the TXT file
the code should be :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
CodePudding user response:
Just work with each url:page inside the loop!
for line in f:
url = line.strip()
html = requests.get(url).text # is .content better?
soup = BeautifulSoup(html, "html.parser")
# work with soup here!
Creating more functions may help your program be easier to read if you find yourself packing a lot into some block
See Cyclomatic Complexity (which is practically a count of the control statements like if
and for
)
Additionally, if you want to collect up all the values before doing further processing (though this is frequently better accomplished with more esoteric logic like a generator or asyncio to collect many pages in parallel), you might consider creating some collection before the loop to store the results
collected_results = [] # a new list
...
for line in fh:
result = # process the line via whatever logic
collected_results.append(result)
# now collected_results has the result from each line