Home > Enterprise >  Why is the html content I got from inspector different from what I got from BeautifulSoup?
Why is the html content I got from inspector different from what I got from BeautifulSoup?

Time:12-17

Here is the site I am trying to scrap data from: [https://www.onestopwineshop.com/collection/type/red-wines][1]

import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')

The code I have above.

It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup. My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?

CodePudding user response:

Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.

To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.

Personally I use chrome so the drivers can be found here...

enter image description here

You don't need selenium or beautifulsoup at all, you can just use requests and parse the json.

  • Related