I am doing a project which requires webscraping news content (the whole article) from websites. the page I am scraping is https://tg24.sky.it/politica, and I got the headlines with the following code:
r= requests.get('https://tg24.sky.it/politica')
b= soup(r.content, 'lxml')
title=[]
for c in b.findAll('h2',{'class':'c-card__title'}):
title.append(c.text.strip())
Now I want to scrape the href and through it acces the whole content . I am having trouble extracting the href
In the webiste the href
is in
<a href="https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi-ultime-notizie">
<article >
<div >
<h2 >Governo Draghi, le ultime notizie sulla crisi aperta da Conte e M5S</h2>
How can I extract the href
I tried
for c in b.findAll('a',{'class':'c-card c-card--CA10-m c-card--CA15-t c-card--CA15-d c-card--media c-card--base '}):
links.append(c.a['href'])
but it does not work.
CodePudding user response:
I would change two things:
- Get the
href
directly: replacec.a['href']
withc['href']
- Specify multiple classes inside a
list
:soup.findAll("a", {"class": ["c-card", "c-card--CA10-m"]})
In code:
import requests
from bs4 import BeautifulSoup
URL = "https://tg24.sky.it/politica"
response = requests.get(URL)
soup = BeautifulSoup(response.text, "lxml")
links = []
for link in soup.findAll("a", {"class": ["c-card", "c-card--CA10-m"]}):
links.append(link["href"])
print(links)
NOTE: I haven't added all the classes, just enough to prove my point.
CodePudding user response:
It seems that you wanna work with different lists
but you should avoid this, in case some information is not available they become different length. So try to scrape all information in one go and store it in more structured way.
To get the links you could select them directly:
soup.find_all('a', {'class': 'c-card'})
or if you come frome your title:
data = []
for e in soup.find_all('h2', {'class': 'c-card__title'}):
data.append({
'title': e.get_text(strip=True),
'url':e.find_previous('a').get('href')
})
data
would look like:
[{'title': 'Letta: "Draghi continui". Salvini: minacce le lascio a signori del No', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-draghi'}, {'title': 'Governo Draghi: chi sono i ministri, vice e sottosegretari del M5S', 'url': 'https://tg24.sky.it/politica/2022/07/16/ministri-movimento-5-stelle-governo-draghi'}, {'title': 'Crisi di governo, cosa ha portato Draghi a volersi dimettere', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-draghi-perche'}, {'title': 'Crisi governo: se si andasse al voto ora come finirebbe? LE GRAFICHE', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-simulazioni-risultati-elezioni'}, {'title': 'Governo Draghi, le ultime notizie sulla crisi aperta da Conte e M5S', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi-ultime-notizie'}, {'title': 'Governo in bilico, Draghi deciso a lasciare e partiti nel caos', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi'}, {'title': 'Patuanelli: "Ritiro ministri M5S? Dimissionario è il governo"', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-ritiro-ministri-m5s-patuanelli'},...]
Note: In newer code avoid old syntax findAll()
instead use find_all()
- For more take a minute to check docs
Example
import requests
from bs4 import BeautifulSoup
response = requests.get('https://tg24.sky.it/politica')
soup = BeautifulSoup(response.text)
data = []
for a in soup.find_all('a', {'class': 'c-card'}):
response = requests.get(a.get('href'))
asoup = BeautifulSoup(response.text)
data.append({
'url': a.get('href'),
'title': a.h2.get_text(strip=True),
'content': asoup.article.get_text(strip=True)
})
print(data)