Home > front end >  Removing specific <h2 class> from beautifulsoup4 web crawling results
Removing specific <h2 class> from beautifulsoup4 web crawling results

Time:12-20

I am currently trying to crawl headlines of the news articles from https://7news.com.au/news/coronavirus-sa.

After I found all headlines are under h2 classes, I wrote following code:

import requests
from bs4 import BeautifulSoup


url = f'https://7news.com.au/news/coronavirus-sa'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
titles = soup.find('body').find_all('h2')

for i in titles:
    print(i.text.strip())

The result of this code was:

News
Discover
Connect
SA COVID cases surge into triple digit figures for first time
Massive headaches at South Australian testing clinics as COVID cases surge
Revellers forced into isolation after SA teen goes clubbing while infectious with COVID
COVID scare hits Ashes Test in Adelaide after two media members test positive
SA to ease restrictions despite record number of COVID cases
‘We’re going to have cases every day’: SA records biggest COVID spike in 18 MONTHS
Fears for Adelaide nursing homes after COVID infections creep detected
Families in pre-Christmas quarantine after COVID alert for Adelaide school
South Australia records a JUMP in new COVID-19 cases - including infections in children
‘LOCK IT IN’: Mark McGowan to reveal date of WA’s long-awaited reopening to Australia
BOOSTER BOOST-UP: Australia makes change to COVID-19 vaccinations amid Omicron concern
Frydenberg calls for Aussies to ‘keep calm and carry on’ in the face of COVID-19 Omicron strain
News Just In
Our Network
Our Partners
Connect with 7NEWS

which contains unnecessary texts such as 'News', 'Discover', and 'News Just In'.

This happened as these texts were under h2 class as well. Thus, I added following codes to delete them from the results:

soup.find('h2', id='css-1oh2gv-StyledHeading.e1fp214b7').decompose()

which turns out to have attribute error.

AttributeError: 'NoneType' object has no attribute 'decompose'

I tried clear() methods as well, but it did not give the result that I wanted.

Is there an another way to remove the texts that are unnecessary?

CodePudding user response:

What happens?

Your selection is just too general, cause it is selecting all <h2> and it do not need a .decompose() to fix the issue.

How to fix?

Select the headlines mor specific:

soup.select('h2.Card-Headline')

Example

import requests
from bs4 import BeautifulSoup


url = f'https://7news.com.au/news/coronavirus-sa'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for h2 in soup.select('h2.Card-Headline'):
    print(h2.text)

Output

SA COVID cases surge into triple digit figures for first time 
Massive headaches at South Australian testing clinics as COVID cases surge
Revellers forced into isolation after SA teen goes clubbing while infectious with COVID
COVID scare hits Ashes Test in Adelaide after two media members test positive
SA to ease restrictions despite record number of COVID cases
‘We’re going to have cases every day’: SA records biggest COVID spike in 18 MONTHS
Fears for Adelaide nursing homes after COVID infections creep detected
Families in pre-Christmas quarantine after COVID alert for Adelaide school
South Australia records a JUMP in new COVID-19 cases - including infections in children
‘LOCK IT IN’: Mark McGowan to reveal date of WA’s long-awaited reopening to Australia
BOOSTER BOOST-UP: Australia makes change to COVID-19 vaccinations amid Omicron concern
Frydenberg calls for Aussies to ‘keep calm and carry on’ in the face of COVID-19 Omicron strain

Just in addition to answer the question at all

Also to decompose() choose your selection more specific - But as mentioned it is not necessary to do this:

for i in titles:
    if 'Heading' in ' '.join(i['class']):
        i.decompose()
  • Related