Whilst trying to scrape https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/ for its header (among other things), instead of receiving <h1>D2K2 die Kleidung 3 & Gegenteile 1</h1>
, I received <h1>One more step…</h1>
. I tried find_all() in my code, and got this: [<h1>One more step…</h1>, <h1 style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>]
.
from bs4 import BeautifulSoup
import requests
html_text = requests.get(input('Link: ')).text
soup = BeautifulSoup(html_text, 'lxml')
flashcard_title = soup.find_all('h1')
print(flashcard_title)
Using print(soup), turns out the site is protected by cloudflare and needs me to do a captcha (I think). Does anyone know how to get around this? Thanks in advance.
CodePudding user response:
The page is loaded dynamically using JavaScript, so not all elements are available,
requests
doesn't support it.You need to add the
user-agent
header, otherwise, the page thinks that your a bot and will block you.
import requests
from html import unescape
from bs4 import BeautifulSoup
URL = "https://quizlet.com/751002352/d2k2-die-kleidung-3-gegenteile-1-flash-cards/"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
soup = BeautifulSoup(requests.get(URL, headers=HEADERS).text, "html.parser")
print(unescape(soup.find("h1").text))
Prints:
D2K2 die Kleidung 3 & Gegenteile 1