Why python requests module not pulling the whole html?-CodePudding

The link: https://www.hyatt.com/explore-hotels/service/hotels

code:

r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())

Tried also this:

r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)

output:

<!DOCTYPE html>
<head>
</head>
<body>
  <script src="SOME_value">
  </script>
</body>
</html>

Its printing the html without the tag the data are in, only showing a single script tag. How to access the data (shown in browsing view, looks like json)?browsing view my code code response)

CodePudding user response：

I don't believe this can be done...That data simply isn't in the r.text

If you do this:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")

print(soup.prettify())

You get this:

<!DOCTYPE html>
<html>
 <head>
 </head>
 <body>
  <script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
  </script>
 </body>
</html>

As you can see there is no <pre> tag for whatever reason. So you're unable to access that.

I also get an 429 Error when accessing the URL:

GET https://www.hyatt.com/explore-hotels/service/hotels 429

What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.

If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.

I would say Hyatt don't want you parsing their site. So don't!

CodePudding user response：

The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.

You can parse this into a useable form with the standard json package:

r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)