I am trying to build a dataframe of article titles and links using BeautifulSoup from a website structured like this:
<div class = "publish date"> Article 1 published date </div>
<div class = "headline">
<a href="article 1 link " target ="_blank" title="Opens in a new window"> Article 1 Title
</a>
</div>
<div class = "publish date"> Article 2 published date</div>
<div class = "headline">
<a href="article 2 link" target ="_blank" title="Opens in a new window"> Article 2 Title
</a>
</div>
I was able to do this from a different webpage previously, but I am running into an error with this webpage.
My code:
r = requests.get(URL,allow_redirects=True)
soup = BeautifulSoup(r.content, 'html5lib')
tag = 'div'
title_class_name = "headline"
df = pd.DataFrame()
title_list = []
link_list=[]
title_table = soup.findAll(tag,attrs= {'class':title_class_name})
link_table = soup.findAll(tag,attrs= {'class':title_class_name})
for (title, link) in zip(title_table, link_table):
title_list.append(title.text)
link_list.append(link['href'])
df['title'] = title_list
df['source'] = link_list
The only difference between the two is that the other website had a specific class within the a tags for the href links, but using soup.findAll(tag,attrs= {'class':title_class_name})
does pull the a tag and the href link so I'm not sure why the line link_list.append(link['href'])
is throwing an error?
CodePudding user response:
Try to change your selection strategy and process all information in one go:
data = []
for e in soup.select('.headline'):
data.append({
'title':e.text.strip(),
'url':e.a.get('href')
})
Note: In newer code avoid old syntax findAll()
instead use find_all()
- For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html='''
<div class = "publish date"> Article 1 published date </div>
<div class = "headline">
<a href="article 1 link " target ="_blank" title="Opens in a new window"> Article 1 Title
</a>
</div>
<div class = "publish date"> Article 2 published date</div>
<div class = "headline">
<a href="article 2 link" target ="_blank" title="Opens in a new window"> Article 2 Title
</a>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('.headline'):
data.append({
'title':e.text.strip(),
'url':e.a.get('href')
})
pd.DataFrame(data)
Output
title | url | |
---|---|---|
0 | Article 1 Title | article 1 link |
1 | Article 2 Title | article 2 link |