I am new and am trying to get BeautifulSoup to work. I have Html problems with recovering classes and tags. I get closer, but there is something I'm wrong. I insert wrong tags and classes to scrape the title, time, link, and text of a news item.
I would like to scrape all those titles in the vertical list, then scrape the date, title, link, and content.
Can you help me with the right html class and tagging please?
I'm not getting any errors, but the python console stays empty
>>>
Code
import requests
from bs4 import BeautifulSoup
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
news = beautify.find_all('div', {'class','$00'})
arti = []
for each in news:
time = each.find('span', {'class','hh serif'}).text
title = each.find('span', {'class','title'}).text
link = each.a.get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text,'html5lib')
content = soup.find('div', class_ = "read__content").text.strip()
print(" ")
print(time)
print(title)
print(link)
print(" ")
print(content)
print(" ")
CodePudding user response:
Here is a solution you can give it a try,
import requests
from bs4 import BeautifulSoup
# mock browser request
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')
news = soup.find_all('div', attrs={"class": "tcc-list-news"})
for each in news:
for div in each.find_all("div"):
print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
print("-- Href ", div.find("a")['href'])
print("-- Text ", " ".join([span.text for span in div.select("a > span")]))
-- Time 11:36
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time 11:24
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time 11:15
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...
EDIT:
Why headers are required here ? How to use Python requests to fake a browser visit a.k.a and generate User Agent?