bs4 data scraping doesn't work on site as cloudflare is enabled-CodePudding

Whilst trying to scrape https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/ for its header (among other things), instead of receiving <h1>D2K2 die Kleidung 3 & Gegenteile 1</h1>, I received <h1>One more step…</h1>. I tried find_all() in my code, and got this: [<h1>One more step…</h1>, <h1 style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>].

from bs4 import BeautifulSoup
import requests

html_text = requests.get(input('Link: ')).text
soup = BeautifulSoup(html_text, 'lxml')
flashcard_title = soup.find_all('h1')
print(flashcard_title)

Using print(soup), turns out the site is protected by cloudflare and needs me to do a captcha (I think). Does anyone know how to get around this? Thanks in advance.

CodePudding user response：

The page is loaded dynamically using JavaScript, so not all elements are available, requests doesn't support it.
You need to add the user-agent header, otherwise, the page thinks that your a bot and will block you.

import requests
from html import unescape
from bs4 import BeautifulSoup


URL = "https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

soup = BeautifulSoup(requests.get(URL, headers=HEADERS).text, "html.parser")


print(unescape(soup.find("h1").text))

Prints:

D2K2 die Kleidung 3 & Gegenteile 1