I am running a Python script that scrapes a website. It uses Imperva to detect automated scripts crawling through it's web pages. Imperva has blocked my IP from accessing the site as soon as I run the script. I did read someone suggest including a time.sleep(random.randint(a,b))
(to try and mimic human behaviour) in the script which it didn't work or perhaps it just wouldn't work as a standalone method. If it's the chrome driver itself that they detect then I guess it would be impossible to avoid. Does anyone have any practical suggestions on things that I could include in my script to bypass this?. Thanks in advance.
CodePudding user response:
Introduction
There are many different components that need to be added to a web scraper to make it undetectable. I recommend using the below code to test your current level of detection:
driver.get("https://bot.sannysoft.com/")
More than likely, you will fail most of those tests right off the bat, fortunately, it's easy to configure a scraper that will pass all of those tests and be completely undetectable.
Selenium-Stealth
selenium-stealth is a python package that is used to avoid detection. Simply...
pip install selenium-stealth
and follow the below configuration:
stealth(driver,
user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
Your web scraper should pass all of the tests, now try to implement this solution on the Imperva site.
More information
If you are still getting blocked, I recommend looking into the random-user-agent library to cycle your user agent within the "user_agent" variable of the selenium-stealth configuration. Otherwise, you could pay for a proxy provider to completely disguise your IP. Although keep in mind, proxy networks currently do not have a selenium configuration.
Information on Proxy Network Selenium Configuration: Python Selenium Proxy Network
Information on Selenium Detectability in the Cloud: Python Selenium AWS Lambda Change WebGL Vendor/Renderer For Undetectable Headless Scraper