Selenium, cloudflare, colab, and JSON-CodePudding

I am trying to find an efficient way to extract data displayed on this page:

https://www.kartanarusheniy.org/messages

Which is pulled from around 44k JSON files which are pulled from https://www.kartanarusheniy.org/api/messages/ by their ID number ( https://www.kartanarusheniy.org/api/messages/1, https://www.kartanarusheniy.org/api/messages/3 etc). The task is to extract all those 44k files. However, the server uses cloudflare which prevents me from just downloading them.

I have made numerous attempts to make it work using Selenium running on Google Colab. Here's the code I've ended up with:

!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--no-sandbox')
options.add_argument("--enable-javascript")
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)

i = 1
while i < 10:
  url = "https://www.kartanarusheniy.org/api/messages"   str(i)
  filename = str(i)   ".json"
  #session_obj = requests.Session ()
  getter = driver.get(url)
  saver = driver.page_source
  print(saver)
  i  = 1

This code gets me HTML files with the regular cloudflare "Checking if the site connection is secure", "Enable JavaScript and cookies to continue", " www.kartanarusheniy.org needs to review the security of your connection before proceeding" messages.

I have used: undetected_cromedriver, and selenium_stealth (as in Selenium headless: How to bypass Cloudflare detection using Selenium ).

What would be my other options in this case?

CodePudding user response：

You might be able to use the undetected-chromedriver mode of SeleniumBase, which has more features than the original undetected-chromedriver. Below is a simple example where it bypasses the Selenium detection and gets to the main site you want, and takes a screenshot, with minimal lines of code.

First, pip install -U seleniumbase, then run the following with python:

from seleniumbase import Driver
from seleniumbase import page_actions

driver = Driver(uc=True)
driver.get("https://www.kartanarusheniy.org/")
page_actions.wait_for_element(driver, "div.main")
screenshot_name = "kartanarusheniy.png"
driver.save_screenshot(screenshot_name)
print("\nScreenshot saved to: %s" % screenshot_name)
driver.quit()