BeautifulSoup Returning "None" When Element Exists?-CodePudding

I am new to web scraping in python, and it seems that some programs work and some don't randomly. I am trying to request a certain string of text from a page (located here and in the html at:

<body >
    <div id="wrapper">
        <div >
            <div >
                <div >
                    <div >
                        <div >
                            <div >
                                <div >
                                    <h4 >Status</h4>
                                        <b>Offline</b>

), following this tutorial (not using the tutorial webpage for scraping), but when typing print(results.prettify()), it returns AttributeError: 'NoneType' object has no attribute 'prettify'. So far, my complete code is

from requests import *
from bs4 import BeautifulSoup
URL = 'https://plancke.io/hypixel/player/stats/Captbugz'
page = get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='wrapper')
print(results.prettify())

Following the tutorial, the statement print(results.prettify()) should've printed html code (and did while I was following the example), but returned AttributeError: 'NoneType' object has no attribute 'prettify' instead. I have looked up the solution to this and I have found 1) Use Selenium and 2) The data isn't getting scraped in the first place. For 1), I don't believe the code I'm looking for would have any problems with Javascript (please correct me on this), and for 2), the get(URL) is actually returning something. I am fairly new to Stack Overflow, so please inform me if I am not following the rules for posting. Thanks!

CodePudding user response：

In this particular case (the URL you mentioned) there is no element with id equal to "wrapper". Thus, soup.find(id='wrapper') can not find anything and returns None as result, which of course has no method prettify. You should check, whether results is not equal to None (i.e. there is an element with the particular id) before calling .prettify().

Since without proper browser headers you end up at Cloudflare, you should adapt your code to the following:

from requests import *
from bs4 import BeautifulSoup
URL = 'https://plancke.io/hypixel/player/stats/Captbugz'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101     Firefox/108.0"}

page = get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(id='wrapper')
print(results)

CodePudding user response：

As a general rule, avoid using from xxx import *, as you can get issues with names, and it makes your code harder to read.

Your URL starts with https, and request.get will give you a 403 error:

import requests 
from bs4 import BeautifulSoup
URL = 'https://plancke.io/hypixel/player/stats/Captbugz'
page = requests.get(URL)
print(page)

--> <Response [403]>

HTTP 403 (Forbidden), is because you're trying to access a https website without a valid certificate: see : documentation

The 403 (Forbidden) status code indicates that the server understood
the request but refuses to authorize it. A server that wishes to
make public why the request has been forbidden can describe that
reason in the response payload (if any).

You can look at this questions on the 403 response: getting 403 error while using requests.get() python

edit: In any case, this doesn't seem to be the cause of the error. The error seems to come from the wrapper part. If you omit it, the prettify() function seems to work, so the issue is finding the wrapper, as the page you get back (which is the error page due to the 403 error) doesn't have tags with id='wrapper'

from requests import *
from bs4 import BeautifulSoup
URL = 'https://plancke.io/hypixel/player/stats/Captbugz'
page = get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
  #results = soup.find(id='wrapper')  remove this line
print(soup.prettify())