I'm new to using Beautiful Soup and web scraping in general; I'm trying to build a dataframe that has the title, content, and publish date from a blog post style website (everything's on one page, there's a title, publish date, and then the post's content). I'm able to get the title and publish date easily enough, but I can't correctly pull the post's content. each post is structured like so:
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>
<h2 class = "thisYear" title = "Click here to display/hide information>
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>
Current Code:
r = requests.get(URL,allow_redirects=True)
soup = BeautifulSoup(r.content, 'html5lib')
tag = 'p'
title_class_name = "thisYear"
news_class_name = "thisYear"
date_class_name = "pubdate"
df = pd.DataFrame()
title_list = []
news_list =[]
date_list = []
title_table = soup.findAll('h2',attrs= {'class':title_class_name})
news_table = soup.findAll(tag,attrs= {'class': None})
date_table = soup.findAll(tag,attrs= {'class':date_class_name})
for (title , news, date) in zip(title_table, news_table, date_table):
title_list.append(title.text)
news_list.append(news.text)
date_list.append(date.text)
df['title'] = title_list
df['news']=news_list
df['publish_date']=date_list
df
I think I see the problem, that it's pulling each paragraph as a separate news entry, but I haven't been able to correct that yet. How would I pull the content that is only in between each tag='h2' and class='thisYear' combination?
CodePudding user response:
You can use for example tag.find_previous
to find to which block the paragraph belongs:
from bs4 import BeautifulSoup
html_doc = """\
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post 1"</p>
<p style="display: block;"> "Second paragraph of post 1"</p>
<h2 class = "thisYear" title = "Click here to display/hide information">
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post 2"</p>
<p style="display: block;"> "Second paragraph of post 2"</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
out = {}
for p in soup.select("h2.thisYear ~ p:not(.pubdate)"):
title = p.find_previous("h2").text.strip()
pubdate = p.find_previous(class_="pubdate").text.strip()
out.setdefault((title, pubdate), []).append(p.text.strip())
print(out)
Prints:
{
('"First Post Title"', "2022-07-11"): [
'"First paragraph of post 1"',
'"Second paragraph of post 1"',
],
('"Second Post Title"', "2022-07-07"): [
'"First paragraph of post 2"',
'"Second paragraph of post 2"',
],
}
EDIT: To transform out
as a DataFrame you can do:
import pandas as pd
df = pd.DataFrame(
[
(title, date, "\n".join(paragraphs))
for (title, date), paragraphs in out.items()
],
columns=["Title", "Date", "Paragraphs"],
)
print(df)
Prints:
Title Date Paragraphs
0 "First Post Title" 2022-07-11 "First paragraph of post 1"\n"Second paragraph of post 1"
1 "Second Post Title" 2022-07-07 "First paragraph of post 2"\n"Second paragraph of post 2"