I'm trying to scrape leboncoin using python and selenium.
I just got started when I noticed they use DataDome for bot detection, so I have to pass a captcha, but before trying to automate any of that (this question is not related to that) I just solved the Captcha by hand on the chromium browser that selenium opens, and It didn't work, whenever I solve it it just goes back to the captcha, I can't access the site, it's stuck in a loop.
Here's my code:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--log-level=3")
driver = webdriver.Chrome(executable_path='chromedriver', options=options)
url = "https://www.leboncoin.fr/voitures/2182521551.htm"
driver.get("https://www.leboncoin.fr")
driver.get(url)
time.sleep(100)
CodePudding user response:
Your code is fine.
The problem is that these kind of firewalls are mostly well protected against automated browsers such as Playwright, Selenium, etc. (In the end, this is what they should do, prevent bots from accessing the site)
You could either tweak your Selenium browsers configuration in such a way that it mimics an actual chrome configuration and tricks DataDome into thinking you're a real user.
Also, you could look at what the payload being sent to the firewall ( in that case to ~/datadome.js ) consists off and try to replicate them. ( by trying to reverse engineer the JavaScript which constructs and sends the payload. )
Keep in mind that they can also create a fingerprint of you by looking at other things like your TLS configuration ( e.g. ciphersuites ) or simply your IP address. Generally if a company uses such a firewall, it means they do not want you to scrape their site, so avoid to do it if that is the case.