Home > Net >  HTTP Error 406 on Python Web scraper copy from Python All in One for Dummies
HTTP Error 406 on Python Web scraper copy from Python All in One for Dummies

Time:12-03

Afternoon all,

I'm following Python All In One for Dummies and have come to the chapter on web-scraping. I'm trying to interact with the website they designed specifically for this chapter, but keep getting an "HTTP Error 406" on all my requests. The initial "Open a page and get a response had the same issue till I pointed it at Google, so decided it was that webpage at fault. Here's my code:

# get request module from URL lib
from urllib import request
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup

# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'

# open that page:
rawpage = request.urlopen(page_url)

#make a BS object from the html page
soup = BeautifulSoup(rawpage, 'html5lib')

# isolate the content block
content = soup.article

# create an empty list for dictionary items
links_list = []

#loop through all the links in the article
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.img.get('src')
        text = link.span.text

        links_list[{'url':url, 'img':img, 'text':text}]
    except AttributeError:
        pass

print(links_list)

and this is the output in the console:

(base) youngdad33@penguin:~/Python/AIO Python$ /usr/bin/python3 "/home/youngdad33/Python/AIO Python/webscrapper.py"
Traceback (most recent call last):
  File "/home/youngdad33/Python/AIO Python/webscrapper.py", line 10, in <module>
    rawpage = request.urlopen(page_url)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable

I gather the most important line is the bottom one "HTTP Error 406: Not Acceptable", which with a bit of digging I understand to mean my request headers aren't being accepted.

So how do I get this working? I'm using VS Code on a Chromebook using Linux Debian on Anaconda 3.

Thank you!

CodePudding user response:

You need to inject user-agent as follows:

# get request module from URL lib
import requests
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup

# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
# open that page:
rawpage = requests.get(page_url,headers=headers)

#make a BS object from the html page
soup = BeautifulSoup(rawpage.content, 'html5lib')

# isolate the content block
content = soup.article

# create an empty list for dictionary items
links_list = []

#loop through all the links in the article
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.img.get('src')
        text = link.span.text

        links_list.append([{'url':url, 'img':img, 'text':text}])
    except AttributeError:
        pass

print(links_list)

Output

[[{'url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/', 'img': '../datascience/python/basics/basics256.jpg', 'text': 'Basics'}], [{'url': 
'https://alansimpson.me/datascience/python/beginner/', 'img': '../datascience/python/beginner/beginner256.jpg', 'text': 'Beginner'}], [{'url': 'https://alansimpson.me/datascience/python/justbasics/', 'img': '../datascience/python/justbasics/justbasics256.jpg', 'text': 'Just the Basics'}], [{'url': 'https://alansimpson.me/datascience/python/cheatography/', 'img': '../datascience/python/cheatography/cheatography256.jpg', 'text': 'Cheatography'}], [{'url': 'https://alansimpson.me/datascience/python/dataquest/', 'img': '../datascience/python/dataquest/dataquest256.jpg', 'text': 'Dataquest'}], [{'url': 'https://alansimpson.me/datascience/python/essentials/', 'img': '../datascience/python/essentials/essentials256.jpg', 'text': 'Essentials'}], [{'url': 'https://alansimpson.me/datascience/python/memento/', 'img': '../datascience/python/memento/memento256.jpg', 'text': 'Memento'}], [{'url': 'https://alansimpson.me/datascience/python/syntax/', 'img': '../datascience/python/syntax/syntax256.jpg', 'text': 'Syntax'}], [{'url': 'https://alansimpson.me/datascience/python/classes/', 'img': '../datascience/python/classes/classes256.jpg', 'text': 'Classes'}], [{'url': 'https://alansimpson.me/datascience/python/dictionaries/', 'img': '../datascience/python/dictionaries/dictionaries256.jpg', 'text': 'Dictionaries'}], [{'url': 'https://alansimpson.me/datascience/python/functions/', 'img': '../datascience/python/functions/functions256.jpg', 'text': 'Functions'}], [{'url': 'https://alansimpson.me/datascience/python/ifwhile/', 'img': '../datascience/python/ifwhile/ifwhile256.jpg', 'text': 'If & While Loops'}], [{'url': 'https://alansimpson.me/datascience/python/lists/', 'img': '../datascience/python/lists/lists256.jpg', 'text': 'Lists'}]]
  • Related