I am working on a python app that will help me get reviews for a particular restaurant. I am using Selenium 4.1 web scraper with python.
After I set up Selenium driver in my project folder I put this code together based on the Selenium documentation:
#YELP REVIEW SCRAPER #
#Importing Dependencies
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
# Setting up driver options
options = webdriver.ChromeOptions()
# Setting up Path to chromedriver executable file
CHROMEDRIVER_PATH ='../Selenium/chromedriver.exe'
# Adding options
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
# Setting up chrome service
service = ChromeService(executable_path=CHROMEDRIVER_PATH)
# Establishing Chrom web driver using set services and options
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
This successfully opens up the Yelp page of the restaurant I want to get reviews for, but when i tried to scrape the reviews using:
driver.find_element(By.CLASS_NAME, ' raw__09f24__T4Ezm')
where: ' raw__09f24__T4Ezm' is the name of the span class of the first review, i get the error:
InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified
(Session info: chrome=96.0.4664.45)
Stacktrace:
Backtrace:
Ordinal0 [0x00BD6903 2517251]
Ordinal0 [0x00B6F8E1 2095329]
Ordinal0 [0x00A72848 1058888]
Ordinal0 [0x00A74F44 1068868]
Ordinal0 [0x00A74E0E 1068558]
Ordinal0 [0x00A75070 1069168]
Ordinal0 [0x00A9D1C2 1233346]
Ordinal0 [0x00A9D63B 1234491]
Ordinal0 [0x00AC7812 1406994]
Ordinal0 [0x00AB650A 1336586]
Ordinal0 [0x00AC5BBF 1399743]
Ordinal0 [0x00AB639B 1336219]
Ordinal0 [0x00A927A7 1189799]
Ordinal0 [0x00A93609 1193481]
GetHandleVerifier [0x00D65904 1577972]
GetHandleVerifier [0x00E10B97 2279047]
GetHandleVerifier [0x00C66D09 534521]
GetHandleVerifier [0x00C65DB9 530601]
Ordinal0 [0x00B74FF9 2117625]
Ordinal0 [0x00B798A8 2136232]
Ordinal0 [0x00B799E2 2136546]
Ordinal0 [0x00B83541 2176321]
BaseThreadInitThunk [0x757C6739 25]
RtlGetFullPathName_UEx [0x773B8AFF 1215]
RtlGetFullPathName_UEx [0x773B8ACD 1165]
I tried researching this error but had no luck. Any idea how to modify my code so I can get all available reviews for this particular restaurant so I can get the date of review, person, score, and the text of the review?
CodePudding user response:
I don't personally know how to parse data with selenium as I use Beautifulsoup, here is a example with Beautifulsoup:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
#driver.get('https://www.nicehash.com/profitability-calculator/nvidia-rtx-3060-ti-lhr')
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
content = driver.page_source
soup = BeautifulSoup(content, features="lxml")
a = soup.findAll("li", attrs={'class':'margin-b5__09f24__pTvws border-color--default__09f24__NPAKY'})
for i in a:
print(i.text)
From there you can parse it again looking for the data you need.