Home > Enterprise >  Verifying Bs4 Parsing Output from a Website
Verifying Bs4 Parsing Output from a Website

Time:10-25

I was trying to scrape this site when I was running into errors due to tags that I thought existed, but did not exist in the scraped html from Bs4.

Site: https://en.thejypshop.com/category/cdlp/59/

I manually verified that the parsed output from Bs4 was giving me a completely different view of the html than when I inspected the site itself; here is a comparison of the two (copied relevant html in the two pastebin links). I also tried scraping with different parsing options such as 'lxml', 'html.parser', etc. but to no avail.

(Bs4 Output): https://pastebin.com/tg4P5DFh

<div >
    <div >
      <a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842">
        <img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" />
      </a>
      <span >
        <img alt="Before add to wish list" categoryno="59"  icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png" />
      </span>
    </div>
    <div >
      <div ></div>
      <div >
        <div ></div>
        <img alt="Add to cart"  onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png" />
        <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer" />
      </div>
    </div>
  </div>

(html from Site): https://pastebin.com/2xfi4XTA

<div >
      <div >
        <a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/1/">
          <img src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" id="eListPrdImage842_1" alt="">
        </a>
      </div>
      <span >
        <img src="/web/upload/icon_202204271744355800.png"  alt="Before add to wish list" productno="842" categoryno="59" icon_status="off" login_status="F" individual-set="F">
        <img src="/web/upload/icon_202204271744303700.png" onclick="category_add_basket('842','59', '1', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" alt="Add to cart" >
      </span>
      <span ></span>
    </div>

Note that the <span ></span> tag does not appear in what Bs4 sees, among other things.

My guess as to why this is the case;

  • I am not using a headless browser, so some websites such as this one might not display the same thing.
  • There is some JS running in the background that Bs4 does not pick up on

Please let me know if any of my guesses are incorrect and what is actually going on!

CodePudding user response:

Yes, you are right as the second page is beeing built dynamicaly so you can't get the real html with bs4. Try to use combination of selenium and bs4 to get what you need. Here is a small script that finds some hidden divs and print them out. You should get deeper insight and simulate web surfing to catch the html when the page is fully developed. This one below is still in the process of construction.

import time
from bs4 import  BeautifulSoup
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)


urls = ['https://en.thejypshop.com/category/cdlp/59/', 'https://pastebin.com/2xfi4XTA']
for url in urls:
    data = driver.get(url)
    time.sleep(1)
    pg_html = driver.page_source
    pg_html = pg_html.replace('&lt;', '<').replace('&gt;', '>')
    soup = BeautifulSoup(pg_html, 'html.parser')
    dv = soup.find_all('div', attrs={'class': 'thumbnail'})
    dv1 = soup.find_all('span', attrs={'class': 'soldout_icon'})
    try:
        print(60 * '-')
        print(dv[0])
    except:
        pass
        
    print(60 * '-')
        
    try:
        print(dv1[0])
        print(60 * '-')
    except:
        pass
'''  R e s u l t :
------------------------------------------------------------
<div >
<div >
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842"><img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg"/></a>
<span ><img alt="Before add to wish list" categoryno="59"  icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png"/></span>
</div>
<div >
<div > </div>
<div >
<div ></div> <img alt="Add to cart"  onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png"/> <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer"/> </div>
</div>
</div>
------------------------------------------------------------
<span ></span>
------------------------------------------------------------
------------------------------------------------------------
<div >
</div>
------------------------------------------------------------
<span ></span>
------------------------------------------------------------
'''

Regards...

  • Related