Home > Net >  HTTP status code is not handled using scrapy and selenium
HTTP status code is not handled using scrapy and selenium

Time:11-29

I am facing the error HTTP status code is not handled or not allowed how to solve these error I am using the selenium and scrapy together I am also using the user agent in setting but the HTTP error will not solve kindly recommend any solution this is page link https://www.askgamblers.com/online-casinos/countries/uk

import scrapy
from scrapy.http import Request
from selenium import webdriver
import time
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager


class TestSpider(scrapy.Spider):
    name = 'test'
    


    def start_requests(self):
            options = webdriver.ChromeOptions()
            options.add_argument("--no-sandbox")
            options.add_argument("--disable-gpu")
            options.add_argument("--window-size=1920x1080")
            options.add_argument("--disable-extensions")
            driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
            
            URL = 'https://www.askgamblers.com/online-casinos/countries/uk'
            driver.get(URL)
            
            time.sleep(3)
            page_links =driver.find_elements(By.XPATH, "//div[@class='card__desc']//a[starts-with(@href, '/online')]")
            for link in page_links:
                    href=link.get_attribute("href")
                    yield scrapy.Request(href)
            driver.quit()


    def parse(self, response):
            title=response.css(By.CSS_SELECTOR, "h1.ch-title::text").get()
            yield{
                    'title':title
                    }

CodePudding user response:

You are getting such error because the website is under cloudflare protection.

https://www.askgamblers.com/online-casinos/countries/uk is using Cloudflare CDN/Proxy!

https://www.askgamblers.com/online-casinos/countries/uk is NOT using Cloudflare SSL

And Scrapy with Selenium/scrapy can't handle(I tested) cloudflare protection but only the powerful selenium engine can do the job.Finally, I integrate bs4 with selenium to parse content more robust way.

Script:

from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
                    
URL = 'https://www.askgamblers.com/online-casinos/countries/uk'
driver.get(URL)
time.sleep(2)
urls= []
page_links =driver.find_elements(By.XPATH, "//div[@class='card__desc']//a[starts-with(@href, '/online')]")
for link in page_links:
    href=link.get_attribute("href")
    urls.append(href)
    #print(href)

for url in urls:
    driver.get(url)
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source,"lxml")
    try:
        title=soup.select_one("h1.ch-title").get_text(strip=True)
        print(title)
    except:
        print('empty')
        pass

Output:

Mr.Play Casino
Bet365 Casino
Slotnite Casino
Trada Casino
PlayFrank Casino
Karamba Casino
Hello! Casino
21 Prive Casino
Casilando Casino
AHTI Games Casino
BacanaPlay Casino
Spinland Casino
Fun Casino
Slot Planet Casino
21 Casino
Conquer Casino
CasinoCasino
Barbados Casino
King Casino
Slots Magic Casino
Spin Station Casino
HeySpin Casino
CasinoLuck
Casino RedKings
  • Related