I am new to using beautifulsoup
here is my current code
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers, verify=False)
src = res.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")
urls = []
for div in soup.find_all("div", attrs={'id':"mountRoot"}):
print(div)
print("\n")
for div_tag in div.find_all('div'):
print(div_tag)
embedded_div = div_tag.find('div')
print(embedded_div)
output of this code:
<div id="mountRoot" style="min-height:750px;margin-top:-2px">< div class="loader-container">< div class="spinner-spinner">< /div>< /div>< /div>
<div class="loader-container">< div class="spinner-spinner">< /div>< /div>
<div class="spinner-spinner">< /div>
<div class="spinner-spinner">< /div>
here is the inspect element of the website that I am looking at : https://i.stack.imgur.com/zui3R.png
to me, it seems that it is ignoring the < div data_reactroot>
What am I doing wrong? any help would appercated
CodePudding user response:
It seems the first rows are cached into the page in a script
tag with attribute type="application/ld json"
like this :
<script type="application/ld json">{ some big json here }</script>
You can get the data by choosing the json with key @type:"ItemList"
and then get the items:
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers)
soup = BeautifulSoup(res.content, 'html.parser')
data_json = [
json.loads(t.text)
for t in soup.findAll("script",{"type":"application/ld json"})
]
data = [
t
for t in data_json
if t["@type"] == "ItemList"
]
print(data[0]["itemListElement"])
But it will only print a few rows, in order to get the data with pagination, there is an API on:
GET https://www.myntra.com/gateway/v2/search/jordan
The following will get the first page using the API:
import requests
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
s.get("https://www.myntra.com/jordan", headers=headers)
# first page
r = s.get("https://www.myntra.com/gateway/v2/search/jordan",
params = {
"p": "1",
"rows": 50,
"o": 0,
"plaEnabled":"false"
},
headers=headers
)
print(r.json())
You will need to increment p
for moving to the next page. Also o
is the offset index, you will increment it by per_page - 1
each time. For example, the second page will have "o":49
if you have set "rows":50