I am a new with beautifulsoup, I usually do web scrapping with scrapy which uses response.xpath
to get the text.
This time, I want to get the article news from a class called article-title
and the pubslished date from a class called meta-posted
The html is look like this:
<div >
<article >
<header >
<h1 style="font-size: 28px !important; font-family: sans-serif !important;">Presentation: Govt pushes CCS/CCUS development in RI upstream sector</h1>
<div >
<span >
Monday, August 1 2022 - 04:27PM WIB </span>
</div>
To get the title, what I have tried is:
title= res.findAll('h1', attrs={'class':'article-title'})
but it still gives me:
[<h1 style="font-size: 28px !important; font-family: sans-serif !important;">Pertagas, Chandra Asri sign gas MoU</h1>]
while to get the date:
date = res.findAll('span', attrs={'class':'meta-posted'})
but it gives me:
[<span style="font-size: large">
</span>,
<span style="font-style: italic">
</span>,
<span >
Tuesday, August 2 2022 - 10:53AM WIB
</span>]
how should I write the code in order to get only the title and also the date?
Thanks in advance
CodePudding user response:
This should fix your problem.
soup = BeautifulSoup(html_doc, 'html.parser')
titles= soup.findAll('h1', attrs={'class':'article-title'})
for title in titles:
print(title.get_text())
dates = soup.findAll('span', attrs={'class':'meta-posted'})
for date in dates:
print(date.get_text())