Home > Mobile >  Scraping a site using selenium and bs4 does not work
Scraping a site using selenium and bs4 does not work

Time:12-13

I am trying to scrape the following site:

https://cve.mitre.org/cve/data_feeds

    driver = webdriver.Chrome()  # brew install chromedirver
    driver.get(self._SCRAPE_WEBSITE_URL)
    page = driver.page_source
    soup = BeautifulSoup(page, 'lxml')
    cve = soup.find_all("li", {"class": "timeline-TweetList-tweet customisable-border"})
    print(cve)

but my print returns an empty list.

any ideas?

CodePudding user response:

The elements you are trying to access are inside an iframe.
In order to access them you have to switch to that iframe.
With Selenium this can be done as following:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # brew install chromedirver
wait = WebDriverWait(driver, 20)

driver.get(self._SCRAPE_WEBSITE_URL)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@id='twitter-widget-0']")))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")))
cve = driver.find_elements(By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")

I guess this can also be done with bs4, however I'm not familiar enough with bs4, so I don't know how to switch into iframe with bs4.
Also don't forget to switch to the default content when you finished dealing with iframe content.

CodePudding user response:

Your print returns an empty list because the html dom under iframe two and you need to switch to get data. Now it's working fine.

You can install manager: pip install webdriver-manager and run code

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time


url = 'https://cve.mitre.org/cve/data_feeds'

cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)

driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)

iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe)


soup = BeautifulSoup(driver.page_source, 'html.parser')
cves =soup.find_all("li", {"class": "timeline-TweetList-tweet customisable-border"})
for cve in cves:
    tweet_text= cve.select_one('p').text
    print(tweet_text)

Result:

CVE-2021-44833 The CLI 1.0.0 for Amazon AWS OpenSearch has weak permissions for the configuration file. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44833 …

CVE-2021-41805 HashiCorp Consul Enterprise before 1.8.17, 1.9.x before 1.9.11, and 1.10.x before 1.10.4 has Incorrect Access Control. An ACL token (with the default operator:write permissions) in one namespace can be used for unintended privilege ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41805 …

CVE-2021-44515 Zoho ManageEngine Desktop Central is vulnerable to authentication bypass, leading to remote code 
execution on the server, as exploited in the wild in December 2021. For Enterprise builds 10.1.2127.17 and earlier, upgrade to 10.1.212... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44515 …

CVE-2021-4097 phpservermon is vulnerable to Improper Neutralization of CRLF Sequences https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4097 …

CVE-2021-4092 yetiforcecrm is vulnerable to Cross-Site Request Forgery (CSRF) https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4092 …

CVE-2021-41242 OpenOlat is a web-basedlearning management system. A path traversal vulnerability exists in OpenOlat prior to versions 15.5.12 and 16.0.5. By providing a filename that contains a relative path as a parameter in some REST methods, it... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41242 …

CVE-2021-26340 A malicious hypervisor in conjunction with an unprivileged attacker process inside an SEV/SEV-ES 
guest VM may fail to flush the Translation Lookaside Buffer (TLB) resulting in unexpected behavior inside the virtual machine (VM). https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-26340 …

CVE-2020-12890 Improper handling of pointers in the System Management Mode (SMM) handling code may allow for a privileged attacker with physical or administrative access to potentially manipulate the AMD Generic Encapsulated Software Architecture ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-12890 …

CVE-2021-43815 Grafana is an open-source platform for monitoring and observability. Grafana prior to versions 8.3.2 and 7.5.12 has a directory traversal for arbitrary .csv files. It only affects in... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43815 …

CVE-2021-4089 snipe-it is vulnerable to Improper Access Control https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4089 …

CVE-2021-23700 All versions of package merge-deep2 are vulnerable to Prototype Pollution via the mergeDeep() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23700 …

CVE-2021-23663 All versions of package sey are vulnerable to Prototype Pollution via the deepmerge() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23663 …

CVE-2021-23639 The package md-to-pdf before 5.0.0 are vulnerable to Remote Code Execution (RCE) due to utilizing the library gray-matter to parse front matter content, without disabling the JS engine. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23639 …

CVE-2021-23561 All versions of package comb are vulnerable to Prototype Pollution via the deepMerge() function. 
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23561 …

CVE-2021-23463 The package com.h2database:h2 from 0 and before 2.0.202 are vulnerable to XML External Entity (XXE) Injection via the org.h2.jdbc.JdbcSQLXML class object, when it receives parsed str... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23463 …

CVE-2021-27984 In Pluck-4.7.15 admin background a remote command execution vulnerability exists when uploading files. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-27984 …
  • Related