Home > front end >  Web Scraping: Correct content is not being returned using BeautifulSoup(page.content,'html.pars
Web Scraping: Correct content is not being returned using BeautifulSoup(page.content,'html.pars

Time:12-30

I was trying to do scraping from AJIO website but it seems that the content which Python is fetching is not exactly same which I see while inspecting element of exact web page. It seems that some sort of java code is present on the page which creates HTML page in backend but when I try to fetch page content in Python, it shows me the java code instead of exact HTML page. Can any one suggest solution for this? Below is the code which I am using.

In the below code I am getting error "TypeError: 'NoneType' object is not iterable" after last line which is because the page is not correctly being fetched through "soup=BeautifulSoup(page.text,'html.parser')". I can see "preview" class while inspecting the HTML page but when python fetch it, i cannot find "preview" class in it.

import requests
from bs4 import BeautifulSoup

url="https://www.ajio.com/men-jeans/c/830216001?query=:relevance&gridColumns=5"
page=requests.get(url)
ajio=BeautifulSoup(page.content,'html.parser')
print(ajio.prettify()) '''Problem

jeans_list = ajio.find('script',attrs={'class':'preview'})
for jeans in jeans_list:
    print(jeans_list.prettify())

CodePudding user response:

If want to parse this site you should get JSON object from JavaScript code. Than convert it to the Python dict and get Jeans data.

Your target looks like this

        <script>
          window.__PRELOADED_STATE__ = {"wishlist":{}, 
    ....
          "apiStatusMessage":""}}};
        </script>

So, you can grab it with regex, parse it to dict and find the place, where your data is stored.

Here is an example how to find products' names and prices

import requests
import re
import json

url="https://www.ajio.com/men-jeans/c/830216001?query=:relevance&gridColumns=5"
page=requests.get(url)

m = re.search(r' window.__PRELOADED_STATE__ = ({. ?}}});', page.text)

raw_json = m.group(1)
data_dict = json.loads(raw_json)


jeans_list = data_dict["grid"]["entities"].values()

for jeans in jeans_list:
    print(f"name: {jeans['name']}; price: {jeans['price']['value']}")
  • Related