I have the following code:
from bs4 import BeautifulSoup
import requests
root = 'https://br.investing.com'
website = f'{root}/news/latest-news'
result = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
content = result.text
soup = BeautifulSoup(content, 'lxml')
box = soup.find('section', id='leftColumn')
links = [link['href'] for link in box.find_all('a', href=True)]
for link in links:
result = requests.get(f'{root}/{link}', headers={"User-Agent": "Mozilla/5.0"})
content = result.text
soup = BeautifulSoup(content, 'lxml')
box = soup.find('section', id='leftColumn')
title = box.find('h1').get_text()
with open('headlines.txt', 'w') as file:
file.write(title)
I intend with this code scrape the URLs of news from a website, access each of these URLs, get its headers and write them on a text file. With this code, I'm just getting one header on the file and receiving AttributeError: 'NoneType' object has no attribute 'find'
. What can be done about this?
CodePudding user response:
In your for loop, here: title = box.find('h1').get_text()
, box is None (i.e NoneType)... which is why you're being told NoneType object has no attribute find
This is probably happening because at some point in the loop, this line: box = soup.find('section', id='leftColumn')
returns None
If box returns None, the next line will throw an error.
You can fix this by checking if box is not None before calling find. So this:
box = soup.find('section', id='leftColumn')
title = box.find('h1').get_text()
will change to
box = soup.find('section', id='leftColumn')
if box is not None:
title = box.find('h1').get_text()
EDIT:
The reason why you're seeing only one header is that you have -w
here: with open('headlines.txt', 'w')
-w
will overwrite the file. I don't understand the contents but I would guess the output is the last header
To fix: replace -w
with -a
. it will add "title" to the file content. You can read about it here: https://www.w3schools.com/python/python_file_write.asp