Home > Software engineering >  How do I avoid imperva bot detection?
How do I avoid imperva bot detection?

Time:05-24

I am running a Python script that scrapes a website. It uses Imperva to detect automated scripts crawling through it's web pages. Imperva has blocked my IP from accessing the site as soon as I run the script. I did read someone suggest including a time.sleep(random.randint(a,b)) (to try and mimic human behaviour) in the script which it didn't work or perhaps it just wouldn't work as a standalone method. If it's the chrome driver itself that they detect then I guess it would be impossible to avoid. Does anyone have any practical suggestions on things that I could include in my script to bypass this?. Thanks in advance.

CodePudding user response:

Introduction

There are many different components that need to be added to a web scraper to make it undetectable. I recommend using the below code to test your current level of detection:

driver.get("https://bot.sannysoft.com/")

More than likely, you will fail most of those tests right off the bat, fortunately, it's easy to configure a scraper that will pass all of those tests and be completely undetectable.

Selenium-Stealth

selenium-stealth is a python package that is used to avoid detection. Simply...

pip install selenium-stealth

and follow the below configuration:

stealth(driver,
        user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36',
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

Your web scraper should pass all of the tests, now try to implement this solution on the Imperva site.

More information

If you are still getting blocked, I recommend looking into the random-user-agent library to cycle your user agent within the "user_agent" variable of the selenium-stealth configuration. Otherwise, you could pay for a proxy provider to completely disguise your IP. Although keep in mind, proxy networks currently do not have a selenium configuration.

Information on Proxy Network Selenium Configuration: Python Selenium Proxy Network

Information on Selenium Detectability in the Cloud: Python Selenium AWS Lambda Change WebGL Vendor/Renderer For Undetectable Headless Scraper

  • Related