Home > Back-end >  get text from a class of a class with beautifulsoup
get text from a class of a class with beautifulsoup

Time:08-02

I am a new with beautifulsoup, I usually do web scrapping with scrapy which uses response.xpath to get the text.

This time, I want to get the article news from a class called article-title and the pubslished date from a class called meta-posted

The html is look like this:

<div >
  <article >
    <header >
       <h1  style="font-size: 28px !important; font-family: sans-serif !important;">Presentation: Govt pushes CCS/CCUS development in RI upstream sector</h1>
       <div >
         <span >
                    Monday, August 1 2022 - 04:27PM WIB </span>
       </div>

To get the title, what I have tried is:

title= res.findAll('h1', attrs={'class':'article-title'})

but it still gives me:

[<h1  style="font-size: 28px !important; font-family: sans-serif !important;">Pertagas, Chandra Asri sign gas MoU</h1>]

while to get the date:

date = res.findAll('span', attrs={'class':'meta-posted'})

but it gives me:

[<span  style="font-size: large">
 </span>,
 <span  style="font-style: italic">
 </span>,
 <span >
                     Tuesday, August 2 2022 - 10:53AM WIB
                 </span>]

how should I write the code in order to get only the title and also the date?

Thanks in advance

CodePudding user response:

This should fix your problem.

soup = BeautifulSoup(html_doc, 'html.parser')

titles= soup.findAll('h1', attrs={'class':'article-title'})
for title in titles:
    print(title.get_text())
    
dates = soup.findAll('span', attrs={'class':'meta-posted'})

for date in dates:
    print(date.get_text())
  • Related