Im running into an issue when webscraping a large web page, my scrape works fine for the first 30 href links however runs into a KeyError: 'href' at around 25% into the page contents.
The elements remain the same for the entire web page i.e there is no difference between the last scraped element and the next element that stops the script. Is this caused by the driver not loading the entire web page in time for the scrape to complete or only partially loading the web page ?
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
from random import randint
chromedriver_path = "C:\Program Files (x86)\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
# options.headless = True
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
url = 'https://hackerone.com/bug-bounty-programs'
driver.get(url)
sleep(randint(15,20))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source,'html.parser')
# driver.quit()
links = soup.find_all("a")
for link in links:
print(link['href'])
CodePudding user response:
There is no need for selenium if wishing to retrieve the bounty links. That seems more desirable than grabbing all links off the page. It also removes the duplicates you get with scraping all links.
Simply use the queryString construct that returns bounties as json. You can update the urls to include the protocol and domain.
import requests
import pandas as pd
data = requests.get('https://hackerone.com/programs/search?query=bounties:yes&sort=name:ascending&limit=1000').json()
df = pd.DataFrame(data['results'])
df['url'] = 'https://hackerone.com' df['url']
print(df.head())
CodePudding user response:
Maximize the window_size and get rid of javascript execution and random time parameters. The following code is working without any issues.
Working code:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url='https://hackerone.com/bug-bounty-programs'
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"lxml")
links = soup.find_all("a")
for link in links:
link=link['href']
if link.startswith("https:"):
print(link)
Output:
https://www.hackerone.com/solutions/vulnerability-management-system
https://www.hackerone.com/solutions/cloud-security-solution
https://www.hackerone.com/solutions/application-security-testing-software
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/product/overview
https://www.hackerone.com/product/bug-bounty-platform
https://www.hackerone.com/product/response-vulnerability-disclosure-program
https://www.hackerone.com/product/security-assessments
https://www.hackerone.com/product/insights
https://www.hackerone.com/services
https://www.hackerone.com/product/pentest
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/partners
https://www.hackerone.com/partners/integrations
https://www.hackerone.com/partners/aws
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/company
https://www.hackerone.com/leadership
https://www.hackerone.com/careers
https://www.hackerone.com/trust
https://www.hackerone.com/press
https://www.hackerone.com/press-archive
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/hackers
https://www.hackerone.com/hackers/hacker101
https://www.hackerone.com/hacktivity
https://www.hackerone.com/leaderboard
https://www.hackerone.com/hacktivitycon
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/resources
https://docs.hackerone.com/
https://www.hackerone.com/events
https://www.hackerone.com/security-at
https://www.hackerone.com/vulnerability-and-security-testing-blog
https://www.hackerone.com/blog/category/application-security
https://www.hackerone.com/blog/category/ethical-hacker
https://www.hackerone.com/blog/category/penetration-testing
https://www.hackerone.com/blog/category/security-compliance
https://www.hackerone.com/blog/category/vulnerability-management
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/contact
https://www.hackerone.com/security-incident
https://www.hackerone.com/security-incident
https://www.hackerone.com/contact
https://www.hackerone.com/solutions/vulnerability-management-system
https://www.hackerone.com/solutions/cloud-security-solution
https://www.hackerone.com/solutions/application-security-testing-software
https://www.hackerone.com/product/overview
https://www.hackerone.com/product/bug-bounty-platform
https://www.hackerone.com/product/response-vulnerability-disclosure-program
https://www.hackerone.com/product/security-assessments
https://www.hackerone.com/product/insights
https://www.hackerone.com/services
https://www.hackerone.com/product/pentest
https://www.hackerone.com/partners
https://www.hackerone.com/partners/integrations
https://www.hackerone.com/partners/aws
https://www.hackerone.com/company
https://www.hackerone.com/leadership
https://www.hackerone.com/careers
https://www.hackerone.com/trust
https://www.hackerone.com/press
https://www.hackerone.com/press-archive
https://www.hackerone.com/hackers
https://www.hackerone.com/hackers/hacker101
https://www.hackerone.com/hacktivity
https://www.hackerone.com/leaderboard
https://www.hackerone.com/hacktivitycon
https://www.hackerone.com/resources
https://docs.hackerone.com/
https://www.hackerone.com/events
https://www.hackerone.com/security-at
https://www.hackerone.com/vulnerability-and-security-testing-blog
https://www.hackerone.com/blog/category/application-security
https://www.hackerone.com/blog/category/ethical-hacker
https://www.hackerone.com/blog/category/penetration-testing
https://www.hackerone.com/blog/category/security-compliance
https://www.hackerone.com/blog/category/vulnerability-management
https://www.hackerone.com
https://www.hackerone.com/product/challenge
https://www.hackerone.com/product/response
https://www.hackerone.com/resources/hacker-powered-security-report
https://www.hackerone.com/resources/responsible-disclosure-overview
https://www.hackerone.com/product/overview
https://www.hackerone.com/blog
https://docs.hackerone.com
https://support.hackerone.com/hc/en-us/requests/new
https://www.hackerone.com/disclosure-guidelines
https://www.hackerone.com/press
https://www.hackerone.com/privacy
https://www.hackerone.com/terms
https://twitter.com/hacker0x01