im trying to go through a list of urls to find something in the html text and writing it to a new list. The issue I have is that although I have a for Loop it outputs only the last url (there are 500 in the list "urls"). I do not know how to make it iterate write and then go to the next iteration instead of iterating and then just writing the last one in the list. Any ideas on how to make that work?
for url in urls:
try:
page = urlopen(url)
except:
print("Error opening the URL")
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {"class": "sp-m-box-section"})
article = []
for url in urls:
article = article.append(content) #here I am completely unsure how to handle it
print(article)
Thanks for any ideas.
CodePudding user response:
Few issues here.
- you overwrite your
article
list after each iteration by declaringarticle=[]
. So it will always have an empty list even when you append. After the last iteration, it doesn't create thearticle=[]
, leaving you with the last thing it appended. - Why iterate through the urls twice?
- I changed it to handle the
try/except
differently.
Basically, try to read the page. If it doesn't the error is raised and continues to then next url (there's no sense in processing the html if it can't read it...plus you'd get an error there as well)
Give this a try:
article = []
for url in urls:
try:
page = urlopen(url)
except:
print("Error opening the URL")
continue
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {"class": "sp-m-box-section"}).text
article.append(content.text) # <- here I'm assuming you want the actual text/content, not the html
print(article)
CodePudding user response:
Does this solve your problem?
article = []
for url in urls:
try:
page = urlopen(url)
except:
print("Error opening the URL")
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {"class": "sp-m-box-section"})
article.append(content)
print(article)