Home > OS >  Bot on heroku - unable to scrap sites because of captcha even though everything works on my pc
Bot on heroku - unable to scrap sites because of captcha even though everything works on my pc

Time:08-22

I have simple bot on heroku which works with discord and scraps sites. Normally i use reuqests module to scrap sites, i get page source and that's all. (note: bot doesn't spam ping sites, only once per day/week, also site i'm pinging is epicgames, but it's not the only one with captcha).


But later i discovered that i get captcha protection in my page source, so i decided to use chromedriver. After setting up chromedriver on heroku, i still got captcha protection on sites. On my pc it worked completely fine even without any options below, it never asked for captcha verification.

So this is what i tried: (note: i use undetected chromedriver - optimized version of selenium chromedriver)


1. In page source it asked for JavaScript to be enabled, so i added chromedriver option

import undetected_chromedriver as webdriver

opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
driver = uc.Chrome(use_subprocess=True, options=opts)

driver.get(url)
print(driver.page_source)

Still showed captcha verification, but now without JavaScript error.


2. After doing some research, i discovered heroku IP might be on some sort of block list so i was suggested to add proxy to chromedriver options

import undetected_chromedriver as webdriver

opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
opts.add_argument(f'--proxy-server=socks5://hostip:port')
driver = uc.Chrome(use_subprocess=True, options=opts)

driver.get(url)
print(driver.page_source)

3. I found similar option to the second one which seemed to work for other, but still site showed captcha verification

import undetected_chromedriver as webdriver
import os
import shutil
import tempfile

class ProxyExtension:
    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "<all_urls>",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {"scripts": ["background.js"]},
        "minimum_chrome_version": "76.0.0"
    }
    """

    background_js = """
    var config = {
        mode: "fixed_servers",
        rules: {
            singleProxy: {
                scheme: "http",
                host: "%s",
                port: %d
            },
            bypassList: ["localhost"]
        }
    };

    chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

    function callbackFn(details) {
        return {
            authCredentials: {
                username: "%s",
                password: "%s"
            }
        };
    }

    chrome.webRequest.onAuthRequired.addListener(
        callbackFn,
        { urls: ["<all_urls>"] },
        ['blocking']
    );
    """

    def __init__(self, host, port, user, password):
        self._dir = os.path.normpath(tempfile.mkdtemp())

        manifest_file = os.path.join(self._dir, "manifest.json")
        with open(manifest_file, mode="w") as f:
            f.write(self.manifest_json)

        background_js = self.background_js % (host, port, user, password)
        background_file = os.path.join(self._dir, "background.js")
        with open(background_file, mode="w") as f:
            f.write(background_js)

    @property
    def directory(self):
        return self._dir

    def __del__(self):
        shutil.rmtree(self._dir)


if __name__ == "__main__":
    proxy = ("hostip", port, "username", "pass")
    proxy_extension = ProxyExtension(*proxy)

    options = uc.ChromeOptions()
    options.add_argument("--enable-javascript")
    options.add_argument(f"--load-extension={proxy_extension.directory}")
    driver = uc.Chrome(use_subprocess=True, options=options)

Also i've tried options like adding --headless option, changing agent to firefox, adding nogpu option and etc.

I've been trying to fix this for a month, now I hope someone knows answer to my problem.

CodePudding user response:

You are likely receiving the captcha due to Heroku having a datacenter ip and probably being flagged or something similar. You have a couple of options you could try using a residential proxy and hope its not flagged and you don't get a captcha or you could pay for a captcha solution like 2Captcha or Capmonster. Not sure exactly what type of captcha you are getting but both support reCaptcha. The 2Captcha Docs have a lot of good information for submitting the captcha once you solve it.

  • Related