I am trying to find an efficient way to extract data displayed on this page:
https://www.kartanarusheniy.org/messages
Which is pulled from around 44k JSON files which are pulled from https://www.kartanarusheniy.org/api/messages/ by their ID number ( https://www.kartanarusheniy.org/api/messages/1, https://www.kartanarusheniy.org/api/messages/3 etc). The task is to extract all those 44k files. However, the server uses cloudflare which prevents me from just downloading them.
I have made numerous attempts to make it work using Selenium running on Google Colab. Here's the code I've ended up with:
!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
from selenium import webdriver
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--no-sandbox')
options.add_argument("--enable-javascript")
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
i = 1
while i < 10:
url = "https://www.kartanarusheniy.org/api/messages" str(i)
filename = str(i) ".json"
#session_obj = requests.Session ()
getter = driver.get(url)
saver = driver.page_source
print(saver)
i = 1
This code gets me HTML files with the regular cloudflare "Checking if the site connection is secure", "Enable JavaScript and cookies to continue", " www.kartanarusheniy.org needs to review the security of your connection before proceeding" messages.
I have used: undetected_cromedriver, and selenium_stealth (as in Selenium headless: How to bypass Cloudflare detection using Selenium ).
What would be my other options in this case?
CodePudding user response:
You might be able to use the undetected-chromedriver mode of SeleniumBase, which has more features than the original undetected-chromedriver. Below is a simple example where it bypasses the Selenium detection and gets to the main site you want, and takes a screenshot, with minimal lines of code.
First, pip install -U seleniumbase
, then run the following with python
:
from seleniumbase import Driver
from seleniumbase import page_actions
driver = Driver(uc=True)
driver.get("https://www.kartanarusheniy.org/")
page_actions.wait_for_element(driver, "div.main")
screenshot_name = "kartanarusheniy.png"
driver.save_screenshot(screenshot_name)
print("\nScreenshot saved to: %s" % screenshot_name)
driver.quit()