I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.
Is there any way?
from urllib.request import urlopen, Request
url = "https://members.parliament.uk/members/commons"
headers={'User-Agent': 'Mozilla/5.0'}
request= Request(url=url, headers=headers)
response = urlopen(request)
data = response.read()
The longer header is this:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
CodePudding user response:
The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4
and now it's working like a charm.
Example:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
page = browser.new_page()
page.goto('https://members.parliament.uk/members/commons')
page.wait_for_timeout(5000)
loc = page.locator('div[]')
html = loc.inner_html()
#print(html)
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())
for card in soup.select('.card.card-member'):
d = {
'Name':card.select_one('.primary-info').get_text(strip=True)
}
data.append(d)
print(data)
Output:
[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]