I have simple bot on heroku which works with discord and scraps sites. Normally i use reuqests module to scrap sites, i get page source and that's all. (note: bot doesn't spam ping sites, only once per day/week, also site i'm pinging is epicgames, but it's not the only one with captcha).
But later i discovered that i get captcha protection in my page source, so i decided to use chromedriver. After setting up chromedriver on heroku, i still got captcha protection on sites. On my pc it worked completely fine even without any options below, it never asked for captcha verification.
So this is what i tried: (note: i use undetected chromedriver - optimized version of selenium chromedriver)
1. In page source it asked for JavaScript to be enabled, so i added chromedriver option
import undetected_chromedriver as webdriver
opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
driver = uc.Chrome(use_subprocess=True, options=opts)
driver.get(url)
print(driver.page_source)
Still showed captcha verification, but now without JavaScript error.
2. After doing some research, i discovered heroku IP might be on some sort of block list so i was suggested to add proxy to chromedriver options
import undetected_chromedriver as webdriver
opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
opts.add_argument(f'--proxy-server=socks5://hostip:port')
driver = uc.Chrome(use_subprocess=True, options=opts)
driver.get(url)
print(driver.page_source)
3. I found similar option to the second one which seemed to work for other, but still site showed captcha verification
import undetected_chromedriver as webdriver
import os
import shutil
import tempfile
class ProxyExtension:
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Chrome Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {"scripts": ["background.js"]},
"minimum_chrome_version": "76.0.0"
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%s",
port: %d
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "%s",
password: "%s"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{ urls: ["<all_urls>"] },
['blocking']
);
"""
def __init__(self, host, port, user, password):
self._dir = os.path.normpath(tempfile.mkdtemp())
manifest_file = os.path.join(self._dir, "manifest.json")
with open(manifest_file, mode="w") as f:
f.write(self.manifest_json)
background_js = self.background_js % (host, port, user, password)
background_file = os.path.join(self._dir, "background.js")
with open(background_file, mode="w") as f:
f.write(background_js)
@property
def directory(self):
return self._dir
def __del__(self):
shutil.rmtree(self._dir)
if __name__ == "__main__":
proxy = ("hostip", port, "username", "pass")
proxy_extension = ProxyExtension(*proxy)
options = uc.ChromeOptions()
options.add_argument("--enable-javascript")
options.add_argument(f"--load-extension={proxy_extension.directory}")
driver = uc.Chrome(use_subprocess=True, options=options)
Also i've tried options like adding --headless option, changing agent to firefox, adding nogpu option and etc.
I've been trying to fix this for a month, now I hope someone knows answer to my problem.
CodePudding user response:
You are likely receiving the captcha due to Heroku having a datacenter ip and probably being flagged or something similar. You have a couple of options you could try using a residential proxy and hope its not flagged and you don't get a captcha or you could pay for a captcha solution like 2Captcha or Capmonster. Not sure exactly what type of captcha you are getting but both support reCaptcha. The 2Captcha Docs have a lot of good information for submitting the captcha once you solve it.