Webscraping project: How to handle "AttributeError: 'NoneType' object has no attribut-CodePudding

First time posting. I am learning python to do a web scraping project for my work. I am trying to collect information on the different projects this organisation shares on their website (my company has asked them permission, so that is all good). I managed to run the code with no issues when scraping their HPV projects (52 in total), but when trying to scrape their HIV projects (a total of 131) I am running the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9556/1476973876.py in <module>
     11     project_description = soup.find('div', class_="column-1").text
     12     project_details = soup.find(class_="block-details")
---> 13     project_number = project_details.find("strong").text
     14     project_start = project_details.find("span", class_="bar-start").text
     15     project_end = project_details.find("span", class_="bar-end").text

AttributeError: 'NoneType' object has no attribute 'find'

When scraping a list of just the first 10 URLs, it works fine. I believe that the problem might be that one of the links doesn't have the "strong" text. If so, how can I identify which link is not working?

Here is my code (sorry if it is messy, would appreciate tips on how to improve)

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


st = time.time()
URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/', ...]
data = []


for URL in URLs:
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    project_title = soup.find('h1').text
    project_description = soup.find('div', class_="column-1").text
    project_details = soup.find(class_="block-details")
    project_number = project_details.find("strong").text
    project_start = project_details.find("span", class_="bar-start").text
    project_end = project_details.find("span", class_="bar-end").text
    project_program = project_details.find("ul").text
    
    
    for node in project_details.find_all("p"):
        keywords = node.text.split(', ')
    project_recipient = keywords[-1]
    
    data.append((project_title, project_description, project_number, project_start, project_end, project_program, project_recipient))

et = time.time()

elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

Thank you so much!

CodePudding user response：

When Some listing have some missing text node value then it will generate NoneTypeError and you can handle it by if else None statement.
Most of the columns element selection were incorrect.

Working Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd


URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/']
data = []

data=[]
for URL in URLs:
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    project_title = soup.find('h1').text
    #print(project_title)
    project_description = soup.find('div', class_="column-1").text
    #print(project_description)
    project_details = soup.find(class_="block-details").get_text(strip=True)
    print(project_details)
    project_number = soup.select_one(".block-details > p  h4  p > strong")

    project_number = project_number.get_text() if project_number else None
    print(project_number)
    project_start = soup.find("span", class_="bar-start").get_text(strip=True)
    print(project_start)
    project_end = soup.find("span", class_="bar-end").get_text(strip=True)
    print(project_end)
    project_program = 'https://www.zonmw.nl'   soup.select('.arrow-list > li > a')[0].get('href')
    print(project_program)
    data.append({
        'project_title':project_title, 
        'project_description':project_description,
        'project_number':project_number, 
        'project_start':project_number, 
        'project_end':project_end, 
        'project_program_url':project_program
        })

df = pd.DataFrame(data)
print(df)

Output:

       project_title  ...                                project_program_url
0  HIV self testing combined with internet counse...  ...  https://www.zonmw.nl/nl/onderzoek-resultaten/p...
1  Training B cells to generate broadly neutraliz...  ...  https://www.zonmw.nl/nl/onderzoek-resultaten/f...
2                     Comorbidity and Aging with HIV  ...  https://www.zonmw.nl/nl/over-zonmw/e-health-en...
3  Een multidisiplinaire richtlijn voor arbeidsge...  ...  https://www.zonmw.nl/nl/onderzoek-resultaten/g...
4  Study to oPtimize ANtiretroviral regimeNs in H...  ...  https://www.zonmw.nl/nl/onderzoek-resultaten/d...
5  U-TOPIA, Towards empowering older persons livi...  ...  https://www.zonmw.nl/nl/over-zonmw/e-health-en...

[6 rows x 6 columns]

CodePudding user response：

First check if xpath/class is availabe or not then use .text. Also initialize empty variables inside for loop. See below sample for validation.

for URL in URLs:
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # initialize the variables. It takes default value blank if xpath is unavailable.
    project_number = project_start = ''
    
    # xpath - [Do not use .text directly on xpath]
    project_number_xp = project_details.find("strong")
    project_start_xp = project_details.find("span", class_="bar-start1")
    
    # apply .text if xpath is available else it will throw error.
    if project_number_xp:
        project_number = project_number_xp.text
    if project_start_xp:
        project_start = project_start_xp.text
        
    # Append the data