I'm trying to scrape usercomments (see disclaimer below). The comments are organized with the following pagination
Im getting the different numbered elements and just clicking on the next button >. The page does change, but the new data does not populate and it looks like this
Here is a short excerpt of the code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
DRIVER_PATH = '***/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH) # depreciation, update!
URL = "https://www.kbb.com/mercedes-benz/cla/2018/consumer-reviews/"
driver.get(URL)
time.sleep(5)
button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[@]')))
button.click()
WebDriverWait(driver, 50)
# driver.close()
What can I do to make the fields reload properly? I appreciate all the info I can get :- )
Disclaimer: This is a first test for a research project, there will be no illegal scraping without permission or any missuse of data!
CodePudding user response:
The page/data is rendered dynamically. You can get the data through the api and iterate through the pages
parameter. You can also, just adjust the number per page and get it within 1 request (provided there are 100 or less reviews).
import requests
import pandas as pd
url = 'https://www.kbb.com/ymm/api/'
payload = {
"operationName":"consumerReviewsQuery",
"variables":{
"year":"2018",
"make":"mercedes-benz",
"model":"cla",
"page":1,
"perPage":100,
"bodystyle":"Sedan",
"sort":"1",
"filter":"",
"trendingTopic":""
},
"query":"query consumerReviewsQuery($year: String, $make: String!, $model: String!, $page: Int!, $perPage: Int!, $isInitialLoad: Boolean, $priceType: String, $bodystyle: String, $vehicleId: String, $trim: String, $sort: String, $trendingTopic: String, $filter: String) {\n consumerreviews(\n year: $year\n make: $make\n model: $model\n page: $page\n perPage: $perPage\n isInitialLoad: $isInitialLoad\n priceType: $priceType\n bodystyle: $bodystyle\n vehicleId: $vehicleId\n trim: $trim\n sort: $sort\n trendingTopic: $trendingTopic\n filter: $filter\n ) {\n numPages\n totalReviews\n reviews {\n id\n nickname\n nicknameDisplay\n location\n anonymous\n email\n sessionId\n visitorId\n sessionCount\n friendlyOwnershipStatus\n year\n model\n make\n vehicleId\n title\n reviewText\n ratingOverall\n ratingValue\n ratingReliability\n ratingPerformance\n ratingStyling\n ratingComfort\n ratingQuality\n submissionDate\n positiveLink\n negativeLink\n numPositiveFeedbacks\n numNegativeFeedbacks\n numFeedbacks\n pros\n cons\n areProsOrConsAvailable\n __typename\n }\n searchTerms\n __typename\n }\n}"}
jsonData = requests.post(url, json=payload).json()
reviews = pd.DataFrame(jsonData['data']['consumerreviews']['reviews'])
Output:
print(reviews)
id nickname ... areProsOrConsAvailable __typename
0 187159459 Love it ... True Reviews
1 179266834 Cremur ... True Reviews
2 176067479 ELSIE ... False Reviews
3 172175820 Noemia ... True Reviews
4 163968274 Pmaze ... True Reviews
5 158405420 Gary ... True Reviews
6 143025966 PMAZE ... True Reviews
7 139966209 Frenchy ... True Reviews
8 139766083 Arizona RN ... True Reviews
9 131870778 GW ... True Reviews
10 120024401 Deekay ... True Reviews
11 119822871 Tony ... True Reviews
12 116958004 MBPDX ... True Reviews
13 115487407 Smitty96 ... True Reviews
14 110965961 chhappy7 ... True Reviews
15 109184667 Tampafun ... True Reviews
16 101289834 Neile ... True Reviews
17 84350718 George ... True Reviews
18 75845132 dav ... True Reviews
19 72639833 Doug ... True Reviews
20 69174734 Carnut ... True Reviews
21 67191860 Mark ... True Reviews
22 65876085 bill ... False Reviews
23 64211472 Lazlow ... True Reviews
24 64008710 psyco ... True Reviews
25 57576670 vars0153 ... False Reviews
26 57574924 Fernando ... False Reviews
27 50932030 anauditor ... True Reviews
28 50346331 Missct1964 ... False Reviews
29 48468674 tekfoc ... True Reviews
30 48003934 BrwnJewel ... False Reviews
31 47955889 Free88 ... True Reviews
32 47726965 Josh ... True Reviews
33 47503009 Derek ... True Reviews
34 44513353 Don Z ... True Reviews
35 43143964 Raquel ... True Reviews
36 43142690 Pajama168 ... True Reviews
37 40484198 JJ ... True Reviews
38 39226477 fox4gib ... True Reviews
39 38915453 Happy in Chicago ... True Reviews
40 38485354 CLA owner ... True Reviews
41 35530044 1st time MB owner ... True Reviews
42 34931432 CC ... True Reviews
43 34151324 First time MB buyer ... True Reviews
44 33259903 tom ... True Reviews
45 32943654 Yash ... True Reviews
46 32472645 TheMarcoIslander ... True Reviews
[47 rows x 33 columns]
CodePudding user response:
I don't see any such major issue in your code block. However, the classnames like ehp7fkv0
are dynamic in nature and is bound to change everytime you access the webapplication afresh. A canonical approach would be to avoid the dynamic values and fall back on static attribute values.
To click() on the clickable element you need to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://www.kbb.com/mercedes-benz/cla/2018/consumer-reviews/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='go to previouse page']"))).click()
Using XPATH:
driver.get('https://www.kbb.com/mercedes-benz/cla/2018/consumer-reviews/') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='go to previouse page']"))).click()
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC