I'm just trying to scrape the titles from the page, but the html that is being loaded with page.inner_html('body') does not include all of the html. I think it may be loaded from JS, but when I look into the network tab in dev tools I cannot seem to find a json or where it's being loaded from. I have tried this with Selenium as well, so there must be something I'm doing fundamentally wrong.
So no items appear from the list, but the regular HTML shows up fine. No amount of waiting for the content to load, will load the information.
#import playwright
from playwright.sync_api import sync_playwright
url = 'https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en'
#open url
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
#enable javascript
page.goto(url)
#enable javascript
#load the page and wait for the page to load
page.wait_for_load_state("networkidle")
#get the html content
html = page.inner_html("body")
print(html)
#close browser
browser.close()
CodePudding user response:
No, the webpage isn't loaded content dynamically by JavaScript
rather it's entirely static HTML DOM
from bs4 import BeautifulSoup
import requests
page = requests.get('https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en')
soup = BeautifulSoup(page.content,'lxml')
data = []
for e in soup.select('div.title'):
d = {
'title':e.a.get_text(strip=True),
}
data.append(d)
print(data)
Output:
[{'title': 'NARUTO THE ANIMATION CHRONICLE\u3000genga made for sale'}, {'title': 'Plex DPCF Haruno Sakura Reboru ring of the eyes'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) ナルト'}, {'title': 'Naruto: Shippuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'Naruto: Shippuden ナルト\u3000(replica)'}, {'title': 'Naruto Shippuuden\u3000(replica) NARUTO -ナルト-'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'MegaHouse ちみ メガ Petit Chara Land NARUTO SHIPPUDEN ナルト blast-of-wind intermediary Even [swirl ナルト special is a volume on ばよ.
All 6 types set] inner bag not opened/box damaged'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝\u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト-'}]