Home > Back-end >  Can not find div embeded in another div using beautifulsoup 4
Can not find div embeded in another div using beautifulsoup 4

Time:10-23

I am new to using beautifulsoup

here is my current code

import requests, json
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}

s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers, verify=False)

src = res.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")
urls = []

for div in soup.find_all("div", attrs={'id':"mountRoot"}):
    print(div)
    print("\n")
    for div_tag in div.find_all('div'):
        print(div_tag)
        embedded_div = div_tag.find('div')
        print(embedded_div)
    

output of this code:

<div id="mountRoot" style="min-height:750px;margin-top:-2px">< div class="loader-container">< div class="spinner-spinner">< /div>< /div>< /div>

<div class="loader-container">< div class="spinner-spinner">< /div>< /div>
<div class="spinner-spinner">< /div>
<div class="spinner-spinner">< /div>

here is the inspect element of the website that I am looking at : https://i.stack.imgur.com/zui3R.png

to me, it seems that it is ignoring the < div data_reactroot>

What am I doing wrong? any help would appercated

CodePudding user response:

It seems the first rows are cached into the page in a script tag with attribute type="application/ld json" like this :

<script type="application/ld json">{ some big json here }</script>

You can get the data by choosing the json with key @type:"ItemList" and then get the items:

import requests, json
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}

s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers)
soup = BeautifulSoup(res.content, 'html.parser')

data_json = [ 
    json.loads(t.text)
    for t in soup.findAll("script",{"type":"application/ld json"})
]
data = [
    t
    for t in data_json
    if t["@type"] == "ItemList"
]
print(data[0]["itemListElement"])

But it will only print a few rows, in order to get the data with pagination, there is an API on:

GET https://www.myntra.com/gateway/v2/search/jordan

The following will get the first page using the API:

import requests

headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}

s = requests.Session()
s.get("https://www.myntra.com/jordan", headers=headers)

# first page
r = s.get("https://www.myntra.com/gateway/v2/search/jordan",
    params = {
        "p": "1",
        "rows": 50,
        "o": 0,
        "plaEnabled":"false"
    },
    headers=headers
)
print(r.json())

You will need to increment p for moving to the next page. Also o is the offset index, you will increment it by per_page - 1 each time. For example, the second page will have "o":49 if you have set "rows":50

  • Related