Loading web page using headless Chrome and Selenium returns Debugging Information, IP Address Ray ID-CodePudding

I am trying to create an application that scrapes certain e-commerce websites. I am using Selenium for this purpose and trying to deploy my application on an ec2 instance running centos. Before deploying, I developed my code locally and it worked but it gives me errors on the remote machine.

The code that I am using

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

ser = Service(ChromeDriverManager().install())
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")

selenium_driver = webdriver.Chrome(service=ser, options=chrome_options)

url = 'https://www.everlane.com/products/womens-cloud-cable-knit-vest-oatmeal?collection=womens-newest-arrivals'

selenium_driver.get(url)

title = selenium_driver.find_element(By.XPATH, '//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span')
print(title.text)

When I try to run this code on remote machine I get an error with the following stacktrace

Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2091, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2076, in wsgi_app
    response = self.handle_exception(e)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/ec2-user/price_tracker/flask_api.py", line 22, in home
    title, price, isSizeAvailable, shop = prices.checkInfoByShop(url, size)
  File "/home/ec2-user/price_tracker/check_prices.py", line 132, in checkInfoByShop
    secondaryPriceXPath=secondaryPriceXPath)
  File "/home/ec2-user/price_tracker/check_prices.py", line 61, in checkSelenium
    title = self.selenium_driver.find_element(By.XPATH, titleXPath)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 1246, in find_element
    'value': value})['value']
  File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span"}
  (Session info: headless chrome=96.0.4664.110)
Stacktrace:
#0 0x559979e8dee3 <unknown>
#1 0x55997995b608 <unknown>
#2 0x559979991aa1 <unknown>
#3 0x559979991c61 <unknown>
#4 0x5599799c4714 <unknown>
#5 0x5599799af29d <unknown>
#6 0x5599799c23bc <unknown>
#7 0x5599799af163 <unknown>
#8 0x559979984bfc <unknown>
#9 0x559979985c05 <unknown>
#10 0x559979ebfbaa <unknown>
#11 0x559979ed5651 <unknown>
#12 0x559979ec0b05 <unknown>
#13 0x559979ed6a68 <unknown>
#14 0x559979eb505f <unknown>
#15 0x559979ef1818 <unknown>
#16 0x559979ef1998 <unknown>
#17 0x559979f0ceed <unknown>
#18 0x7ff5dd53b40b <unknown>

For debugging purposes, I tried to read the entire body of the webpage using

body = selenium_driver.find_element(By.XPATH, '/html/body')
print(body.text)

which returns

"We're sorry, something has gone wrong. Please try again.\nIf you continue to have trouble, please contact us at [email protected].\nChecking your browser before accessing www.everlane.com.\nThis process is automatic. Your browser will redirect to your requested content shortly.\nPlease allow up to 5 seconds…\nDebugging Information\nIP Address\n<ip-address>\nRay ID\n6c57184d797805a0"

I understand that my request might be getting blocked for some reason but is there a way to bypass this?

I have tried adding wait statements in the hope of landing on the redirect but nothing has worked so far.

CodePudding user response：

That message looks like the page content has been changed. So your code is working as intended. I'd have Selenium wait for an element to be visible (Read more here). If you don't want to do that you can also wait for the page to redirect. How to do that is answered in another SO question here.

CodePudding user response：

Because of the message

Checking your browser before accessing www.everlane.com.\nThis process is automatic. Your browser will redirect to your requested content shortly.

seems this site has Cloud fare protection enabled.

See the reference: https://thegeekpage.com/how-to-fix-checking-your-browser-before-accessing-message/

I suggest to try selenium-stealth

https://pypi.org/project/selenium-stealth/

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth

ser = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(service=ser, options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

url = 'https://www.everlane.com/products/womens-cloud-cable-knit-vest-oatmeal?collection=womens-newest-arrivals'

driver.get(url)
title = selenium_driver.find_element(By.XPATH, '//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span')
print(title.text)

Also, some of this repos might be helpful:

Or look at this topic:

https://github.com/topics/cloudflare-bypass

CodePudding user response：

I'd suggest using webdriver waits to wait for the page to load.

wait=WebDriverWait(driver,selenium_driver)                                 
elem=wait.until(EC.visibility_of_element_located((By.XPATH,"//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span")))
print(elem.text)

Imports:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response：

This message...

"We're sorry, something has gone wrong. Please try again.\nIf you continue to have trouble, please contact us at [email protected].\nChecking your browser before accessing www.everlane.com.\nThis process is automatic. Your browser will redirect to your requested content shortly.\nPlease allow up to 5 seconds…\nDebugging Information\nIP Address\n<ip-address>\nRay ID\n6c57184d797805a0"

...implies that Selenium driven ChromeDriver initiated google-chrome Browsing Context was detected as a bot.

However, I was able to bypass the detection through google-chrome-headless using a few arguments as follows:

Code Block:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.headless = True
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.everlane.com/products/womens-cloud-cable-knit-vest-oatmeal?collection=womens-newest-arrivals")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[@class='product-heading__name']/span"))).text)
driver.quit()

Console Output:
```
The Cloud Cable-Knit Vest
```