Home > Software design >  bs4 data scraping doesn't work on site as cloudflare is enabled
bs4 data scraping doesn't work on site as cloudflare is enabled

Time:01-07

Whilst trying to scrape https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/ for its header (among other things), instead of receiving <h1>D2K2 die Kleidung 3 & Gegenteile 1</h1>, I received <h1>One more step…</h1>. I tried find_all() in my code, and got this: [<h1>One more step…</h1>, <h1 style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>].

from bs4 import BeautifulSoup
import requests

html_text = requests.get(input('Link: ')).text
soup = BeautifulSoup(html_text, 'lxml')
flashcard_title = soup.find_all('h1')
print(flashcard_title)

Using print(soup), turns out the site is protected by cloudflare and needs me to do a captcha (I think). Does anyone know how to get around this? Thanks in advance.

CodePudding user response:

  1. The page is loaded dynamically using JavaScript, so not all elements are available, requests doesn't support it.

  2. You need to add the user-agent header, otherwise, the page thinks that your a bot and will block you.

import requests
from html import unescape
from bs4 import BeautifulSoup


URL = "https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

soup = BeautifulSoup(requests.get(URL, headers=HEADERS).text, "html.parser")


print(unescape(soup.find("h1").text))

Prints:

D2K2 die Kleidung 3 & Gegenteile 1
  • Related