First time posting. I am learning python to do a web scraping project for my work. I am trying to collect information on the different projects this organisation shares on their website (my company has asked them permission, so that is all good). I managed to run the code with no issues when scraping their HPV projects (52 in total), but when trying to scrape their HIV projects (a total of 131) I am running the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9556/1476973876.py in <module>
11 project_description = soup.find('div', class_="column-1").text
12 project_details = soup.find(class_="block-details")
---> 13 project_number = project_details.find("strong").text
14 project_start = project_details.find("span", class_="bar-start").text
15 project_end = project_details.find("span", class_="bar-end").text
AttributeError: 'NoneType' object has no attribute 'find'
When scraping a list of just the first 10 URLs, it works fine. I believe that the problem might be that one of the links doesn't have the "strong" text. If so, how can I identify which link is not working?
Here is my code (sorry if it is messy, would appreciate tips on how to improve)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
st = time.time()
URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/', ...]
data = []
for URL in URLs:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
project_title = soup.find('h1').text
project_description = soup.find('div', class_="column-1").text
project_details = soup.find(class_="block-details")
project_number = project_details.find("strong").text
project_start = project_details.find("span", class_="bar-start").text
project_end = project_details.find("span", class_="bar-end").text
project_program = project_details.find("ul").text
for node in project_details.find_all("p"):
keywords = node.text.split(', ')
project_recipient = keywords[-1]
data.append((project_title, project_description, project_number, project_start, project_end, project_program, project_recipient))
et = time.time()
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')
Thank you so much!
CodePudding user response:
When Some listing have some missing text node value then it will generate NoneTypeError and you can handle it by if else None statement.
Most of the columns element selection were incorrect.
Working Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
URLs = ['https://www.zonmw.nl/nl/onderzoek-resultaten/preventie/gezonde-wijk-en-omgeving/programmas/project-detail/preventieprogramma-4/hiv-self-testing-combined-with-internet-counselling-a-low-threshold-strategy-to-increase-diagnoses/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/fundamenteel-onderzoek/programmas/project-detail/vici/training-b-cells-to-generate-broadly-neutralizing-hiv-antibodies/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/diseasemanagement-chronische-ziekten/comorbidity-and-aging-with-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/geneesmiddelen/programmas/project-detail/kennisbeleid-kwaliteit-curatieve-zorg/een-multidisiplinaire-richtlijn-voor-arbeidsgerelateerde-problematiek-bij-mensen-met-hiv/', 'https://www.zonmw.nl/nl/onderzoek-resultaten/doelmatigheidsonderzoek/programmas/project-detail/goed-gebruik-geneesmiddelen/study-to-optimize-antiretroviral-regimens-in-hiv-infected-women-who-want-to-breastfeed-panna-b/', 'https://www.zonmw.nl/nl/over-zonmw/e-health-en-ict-in-de-zorg/programmas/project-detail/active-and-assisted-living-aal2/u-topia-towards-empowering-older-persons-living-with-hiv/']
data = []
data=[]
for URL in URLs:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
project_title = soup.find('h1').text
#print(project_title)
project_description = soup.find('div', class_="column-1").text
#print(project_description)
project_details = soup.find(class_="block-details").get_text(strip=True)
print(project_details)
project_number = soup.select_one(".block-details > p h4 p > strong")
project_number = project_number.get_text() if project_number else None
print(project_number)
project_start = soup.find("span", class_="bar-start").get_text(strip=True)
print(project_start)
project_end = soup.find("span", class_="bar-end").get_text(strip=True)
print(project_end)
project_program = 'https://www.zonmw.nl' soup.select('.arrow-list > li > a')[0].get('href')
print(project_program)
data.append({
'project_title':project_title,
'project_description':project_description,
'project_number':project_number,
'project_start':project_number,
'project_end':project_end,
'project_program_url':project_program
})
df = pd.DataFrame(data)
print(df)
Output:
project_title ... project_program_url
0 HIV self testing combined with internet counse... ... https://www.zonmw.nl/nl/onderzoek-resultaten/p...
1 Training B cells to generate broadly neutraliz... ... https://www.zonmw.nl/nl/onderzoek-resultaten/f...
2 Comorbidity and Aging with HIV ... https://www.zonmw.nl/nl/over-zonmw/e-health-en...
3 Een multidisiplinaire richtlijn voor arbeidsge... ... https://www.zonmw.nl/nl/onderzoek-resultaten/g...
4 Study to oPtimize ANtiretroviral regimeNs in H... ... https://www.zonmw.nl/nl/onderzoek-resultaten/d...
5 U-TOPIA, Towards empowering older persons livi... ... https://www.zonmw.nl/nl/over-zonmw/e-health-en...
[6 rows x 6 columns]
CodePudding user response:
First check if xpath/class is availabe or not then use .text. Also initialize empty variables inside for loop. See below sample for validation.
for URL in URLs:
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# initialize the variables. It takes default value blank if xpath is unavailable.
project_number = project_start = ''
# xpath - [Do not use .text directly on xpath]
project_number_xp = project_details.find("strong")
project_start_xp = project_details.find("span", class_="bar-start1")
# apply .text if xpath is available else it will throw error.
if project_number_xp:
project_number = project_number_xp.text
if project_start_xp:
project_start = project_start_xp.text
# Append the data