python/ beautifulsoup KeyError: 'href'-CodePudding

I am using bs4 to write a webscraper to obtain funding news data.

The first part of my code extracts the title, link, summary and date of each article for n number of pages.
The second part of my code loops through the link column and inputs the resulting url in a new function, which extracts the url of the company in question.

For the most part, the code works fine (40 pages scraped without errors). I am trying to stress test it by raising it to 80 pages, but i'm running into KeyError: 'href' and I don't know how to fix this.

import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from tqdm import tqdm

def clean_data(column):
    df[column]= df[column].str.encode('ascii', 'ignore').str.decode('ascii')

#extract

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    url = f'https://www.uktechnews.info/category/investment-round/series-a/page/{page}/'
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    return soup

#transform

def transform(soup):
    
    for item in soup.find_all('div', class_ = 'post-block-style'):
        title = item.find('h3', {'class': 'post-title'}).text.replace('\n','')
        link = item.find('a')['href']
        summary = item.find('p').text
        date = item.find('span', {'class': 'post-meta-date'}).text.replace('\n','')
        
        news = {
            'title': title,
            'link': link,
            'summary': summary,
            'date': date
        }
        newslist.append(news)
    return

newslist = []

#subpage
def extract_subpage(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    r = requests.get(url, headers)
    soup_subpage = BeautifulSoup(r.text, 'html.parser')
    
    return soup_subpage

def transform_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data):
        subpage_link = {
            'subpage_link': main_data[0]['href']
        }
        subpage.append(subpage_link)
    else:
        subpage_link = {
            'subpage_link': '--'
        }
        subpage.append(subpage_link)
    return
    
subpage = []

#load

page = np.arange(0, 80, 1).tolist()

for page in tqdm(page):
    try:
        c = extract(page)
        transform(c)
    except:
        None

df1 = pd.DataFrame(newslist)   

for url in tqdm(df1['link']):
    t = extract_subpage(url)
    transform_subpage(t)

df2 = pd.DataFrame(subpage)

Here is a screenshot of the error:

Screenshot

I think the issue is that my if statement for the transform_subpage function does not account for instances where main_data is not an empty list but does not contain href links. I am relatively new to Python so any help would be much appreciated!

CodePudding user response：

You are correct, it's caused by main_data[0] not having an 'href' attribute at some point. You can try changing the logic to something like:

def transform_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data):
        if 'href' in main_data[0].attrs:
            subpage_link = {
                'subpage_link': main_data[0]['href']
            }
            subpage.append(subpage_link)
        else:
            subpage_link = {
                'subpage_link': '--'
            }
            subpage.append(subpage_link)

Also just a note, it's probably not a great idea to iterate through a variable list, and use the same variable name for each item in the list. So change to something like:

page_list = np.arange(0, 80, 1).tolist()

for page in tqdm(page_list):