I was trying to do scraping from AJIO website but it seems that the content which Python is fetching is not exactly same which I see while inspecting element of exact web page. It seems that some sort of java code is present on the page which creates HTML page in backend but when I try to fetch page content in Python, it shows me the java code instead of exact HTML page. Can any one suggest solution for this? Below is the code which I am using.
In the below code I am getting error "TypeError: 'NoneType' object is not iterable" after last line which is because the page is not correctly being fetched through "soup=BeautifulSoup(page.text,'html.parser')". I can see "preview" class while inspecting the HTML page but when python fetch it, i cannot find "preview" class in it.
import requests
from bs4 import BeautifulSoup
url="https://www.ajio.com/men-jeans/c/830216001?query=:relevance&gridColumns=5"
page=requests.get(url)
ajio=BeautifulSoup(page.content,'html.parser')
print(ajio.prettify()) '''Problem
jeans_list = ajio.find('script',attrs={'class':'preview'})
for jeans in jeans_list:
print(jeans_list.prettify())
CodePudding user response:
If want to parse this site you should get JSON object from JavaScript code. Than convert it to the Python dict and get Jeans data.
Your target looks like this
<script>
window.__PRELOADED_STATE__ = {"wishlist":{},
....
"apiStatusMessage":""}}};
</script>
So, you can grab it with regex, parse it to dict and find the place, where your data is stored.
Here is an example how to find products' names and prices
import requests
import re
import json
url="https://www.ajio.com/men-jeans/c/830216001?query=:relevance&gridColumns=5"
page=requests.get(url)
m = re.search(r' window.__PRELOADED_STATE__ = ({. ?}}});', page.text)
raw_json = m.group(1)
data_dict = json.loads(raw_json)
jeans_list = data_dict["grid"]["entities"].values()
for jeans in jeans_list:
print(f"name: {jeans['name']}; price: {jeans['price']['value']}")