Python - Loop through URLs, finding text, writing to new list-CodePudding

im trying to go through a list of urls to find something in the html text and writing it to a new list. The issue I have is that although I have a for Loop it outputs only the last url (there are 500 in the list "urls"). I do not know how to make it iterate write and then go to the next iteration instead of iterating and then just writing the last one in the list. Any ideas on how to make that work?

for url in urls:
    try:
         page = urlopen(url)
    except:
        print("Error opening the URL")    
    soup = BeautifulSoup(page, 'html.parser')
    content = soup.find('div', {"class": "sp-m-box-section"})
    article = []
        
    for url in urls:
        article = article.append(content)   #here I am completely unsure how to handle it
print(article)

Thanks for any ideas.

CodePudding user response：

Few issues here.

you overwrite your article list after each iteration by declaring article=[]. So it will always have an empty list even when you append. After the last iteration, it doesn't create the article=[], leaving you with the last thing it appended.
Why iterate through the urls twice?
I changed it to handle the try/except differently.

Basically, try to read the page. If it doesn't the error is raised and continues to then next url (there's no sense in processing the html if it can't read it...plus you'd get an error there as well)

Give this a try:

article = []
for url in urls:
    try:
         page = urlopen(url)
    except:
        print("Error opening the URL") 
        continue
    soup = BeautifulSoup(page, 'html.parser')
    content = soup.find('div', {"class": "sp-m-box-section"}).text
    article.append(content.text) # <- here I'm assuming you want the actual text/content, not the html  

print(article)

CodePudding user response：

Does this solve your problem?

article = []

for url in urls:
    try:
         page = urlopen(url)
    except:
        print("Error opening the URL")    
    soup = BeautifulSoup(page, 'html.parser')
    content = soup.find('div', {"class": "sp-m-box-section"})       
    article.append(content)

print(article)