Home > database >  KeyError: 'href' when using BeautifulSoup?
KeyError: 'href' when using BeautifulSoup?

Time:07-12

I am trying to build a dataframe of article titles and links using BeautifulSoup from a website structured like this:

<div class = "publish date"> Article 1 published date </div>
<div class = "headline">
 <a href="article 1 link " target ="_blank" title="Opens in a new window"> Article 1 Title
 </a>
</div>

<div class = "publish date"> Article 2 published date</div>
<div class = "headline">
 <a href="article 2 link" target ="_blank" title="Opens in a new window"> Article 2 Title
 </a>
</div>

I was able to do this from a different webpage previously, but I am running into an error with this webpage.

My code:

r = requests.get(URL,allow_redirects=True)
soup = BeautifulSoup(r.content, 'html5lib')
    
tag = 'div'
title_class_name = "headline"


df = pd.DataFrame()
title_list = []
link_list=[]

title_table = soup.findAll(tag,attrs= {'class':title_class_name})
link_table = soup.findAll(tag,attrs= {'class':title_class_name})

for (title, link) in zip(title_table, link_table):
    title_list.append(title.text)
    link_list.append(link['href']) 
    df['title'] = title_list
    df['source'] = link_list

The only difference between the two is that the other website had a specific class within the a tags for the href links, but using soup.findAll(tag,attrs= {'class':title_class_name}) does pull the a tag and the href link so I'm not sure why the line link_list.append(link['href']) is throwing an error?

CodePudding user response:

Try to change your selection strategy and process all information in one go:

data = []
for e in soup.select('.headline'):
    data.append({
        'title':e.text.strip(),
        'url':e.a.get('href')
    })

Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs

Example

from bs4 import BeautifulSoup
html='''
<div class = "publish date"> Article 1 published date </div>
<div class = "headline">
 <a href="article 1 link " target ="_blank" title="Opens in a new window"> Article 1 Title
 </a>
</div>

<div class = "publish date"> Article 2 published date</div>
<div class = "headline">
 <a href="article 2 link" target ="_blank" title="Opens in a new window"> Article 2 Title
 </a>
</div>
'''

soup = BeautifulSoup(html)

data = []
for e in soup.select('.headline'):
    data.append({
        'title':e.text.strip(),
        'url':e.a.get('href')
    })
pd.DataFrame(data)

Output

title url
0 Article 1 Title article 1 link
1 Article 2 Title article 2 link
  • Related