Python - How to scrape Yelp Review using selenium?-CodePudding

I am working on a python app that will help me get reviews for a particular restaurant. I am using Selenium 4.1 web scraper with python.

After I set up Selenium driver in my project folder I put this code together based on the Selenium documentation:

#YELP REVIEW SCRAPER                                 #

#Importing Dependencies
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
# Setting up driver options
options = webdriver.ChromeOptions()
# Setting up Path to chromedriver executable file
CHROMEDRIVER_PATH ='../Selenium/chromedriver.exe'
# Adding options
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
# Setting up chrome service
service = ChromeService(executable_path=CHROMEDRIVER_PATH)
# Establishing Chrom web driver using set services and options
driver = webdriver.Chrome(service=service, options=options)

driver.get('https://www.yelp.com/biz/taste-of-texas-houston')

This successfully opens up the Yelp page of the restaurant I want to get reviews for, but when i tried to scrape the reviews using:

driver.find_element(By.CLASS_NAME, ' raw__09f24__T4Ezm')

where: ' raw__09f24__T4Ezm' is the name of the span class of the first review, i get the error:

InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified
  (Session info: chrome=96.0.4664.45)
Stacktrace:
Backtrace:
    Ordinal0 [0x00BD6903 2517251]
    Ordinal0 [0x00B6F8E1 2095329]
    Ordinal0 [0x00A72848 1058888]
    Ordinal0 [0x00A74F44 1068868]
    Ordinal0 [0x00A74E0E 1068558]
    Ordinal0 [0x00A75070 1069168]
    Ordinal0 [0x00A9D1C2 1233346]
    Ordinal0 [0x00A9D63B 1234491]
    Ordinal0 [0x00AC7812 1406994]
    Ordinal0 [0x00AB650A 1336586]
    Ordinal0 [0x00AC5BBF 1399743]
    Ordinal0 [0x00AB639B 1336219]
    Ordinal0 [0x00A927A7 1189799]
    Ordinal0 [0x00A93609 1193481]
    GetHandleVerifier [0x00D65904 1577972]
    GetHandleVerifier [0x00E10B97 2279047]
    GetHandleVerifier [0x00C66D09 534521]
    GetHandleVerifier [0x00C65DB9 530601]
    Ordinal0 [0x00B74FF9 2117625]
    Ordinal0 [0x00B798A8 2136232]
    Ordinal0 [0x00B799E2 2136546]
    Ordinal0 [0x00B83541 2176321]
    BaseThreadInitThunk [0x757C6739 25]
    RtlGetFullPathName_UEx [0x773B8AFF 1215]
    RtlGetFullPathName_UEx [0x773B8ACD 1165]

I tried researching this error but had no luck. Any idea how to modify my code so I can get all available reviews for this particular restaurant so I can get the date of review, person, score, and the text of the review?

CodePudding user response：

I don't personally know how to parse data with selenium as I use Beautifulsoup, here is a example with Beautifulsoup:


from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
#driver.get('https://www.nicehash.com/profitability-calculator/nvidia-rtx-3060-ti-lhr')
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')


content = driver.page_source
soup = BeautifulSoup(content, features="lxml")
a = soup.findAll("li", attrs={'class':'margin-b5__09f24__pTvws border-color--default__09f24__NPAKY'})

for i in a:
    print(i.text)

From there you can parse it again looking for the data you need.