Home > database >  How can I extract the href from thi url?
How can I extract the href from thi url?

Time:07-16

I am doing a project which requires webscraping news content (the whole article) from websites. the page I am scraping is https://tg24.sky.it/politica, and I got the headlines with the following code:

r= requests.get('https://tg24.sky.it/politica')
b= soup(r.content, 'lxml')

title=[]

for c in b.findAll('h2',{'class':'c-card__title'}):
   title.append(c.text.strip())

Now I want to scrape the href and through it acces the whole content . I am having trouble extracting the href In the webiste the href is in

<a  href="https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi-ultime-notizie">
        <article >
            <div >
                
                <h2 >Governo Draghi, le ultime notizie sulla crisi aperta da Conte e M5S</h2>

How can I extract the href I tried


for c in b.findAll('a',{'class':'c-card c-card--CA10-m c-card--CA15-t c-card--CA15-d c-card--media  c-card--base '}):
    links.append(c.a['href'])

but it does not work.

CodePudding user response:

I would change two things:

  1. Get the href directly: replace c.a['href'] with c['href']
  2. Specify multiple classes inside a list: soup.findAll("a", {"class": ["c-card", "c-card--CA10-m"]})

In code:

import requests
from bs4 import BeautifulSoup

URL = "https://tg24.sky.it/politica"

response = requests.get(URL)
soup = BeautifulSoup(response.text, "lxml")

links = []
for link in soup.findAll("a", {"class": ["c-card", "c-card--CA10-m"]}):
    links.append(link["href"])

print(links)

NOTE: I haven't added all the classes, just enough to prove my point.

CodePudding user response:

It seems that you wanna work with different lists but you should avoid this, in case some information is not available they become different length. So try to scrape all information in one go and store it in more structured way.

To get the links you could select them directly:

soup.find_all('a', {'class': 'c-card'})

or if you come frome your title:

data = []
for e in soup.find_all('h2', {'class': 'c-card__title'}):
    data.append({
        'title': e.get_text(strip=True),
        'url':e.find_previous('a').get('href')
    })

data would look like:

[{'title': 'Letta: "Draghi continui". Salvini: minacce le lascio a signori del No', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-draghi'}, {'title': 'Governo Draghi: chi sono i ministri, vice e sottosegretari del M5S', 'url': 'https://tg24.sky.it/politica/2022/07/16/ministri-movimento-5-stelle-governo-draghi'}, {'title': 'Crisi di governo, cosa ha portato Draghi a volersi dimettere', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-draghi-perche'}, {'title': 'Crisi governo: se si andasse al voto ora come finirebbe? LE GRAFICHE', 'url': 'https://tg24.sky.it/politica/2022/07/16/crisi-governo-simulazioni-risultati-elezioni'}, {'title': 'Governo Draghi, le ultime notizie sulla crisi aperta da Conte e M5S', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi-ultime-notizie'}, {'title': 'Governo in bilico, Draghi deciso a lasciare e partiti nel caos', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-draghi'}, {'title': 'Patuanelli: "Ritiro ministri M5S? Dimissionario è il governo"', 'url': 'https://tg24.sky.it/politica/2022/07/15/crisi-governo-ritiro-ministri-m5s-patuanelli'},...]

Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs

Example

import requests
from bs4 import BeautifulSoup

response = requests.get('https://tg24.sky.it/politica')
soup = BeautifulSoup(response.text)

data = []

for a in soup.find_all('a', {'class': 'c-card'}):

    response = requests.get(a.get('href'))
    asoup = BeautifulSoup(response.text)
    data.append({
        'url': a.get('href'),
        'title': a.h2.get_text(strip=True),
        'content': asoup.article.get_text(strip=True)
    })

print(data)
  • Related