I am currently a newbie with Scrapy and my issue is with this website: https://product.sanyglobal.com/excavator/mini_excavator/108/ If you scroll a bit to the middle you will see 3 types of equipment:
- 'SY16C'
- 'SY16 T4F'
- 'SY18C'.
I tried examining the HTML code to find the URLs or a possible API request in the network tab, but I found none. Can you please explain to me how they are hidden and if possible a way to scrape those links?
CodePudding user response:
Your above mentioned data is populated by JavaScript
meaning the webpage is dynamic and scrapy can't render JS. So you can use an automation tool something like selenium to grab the desired data. If you would like less complexities then you can try the next example selenium with bs and pandas.
Example:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url = 'https://product.sanyglobal.com/excavator/mini_excavator/108/'
driver.get(url)
driver.maximize_window()
time.sleep(3)
soup=BeautifulSoup(driver.page_source,'lxml')
for table in soup.select('table.el-table__body')[0:1]:
df = pd.read_html(str(table))
print(df[0])
Output:
0 1 2 3
0 SY16C Brochure Inquiry 1.88T 10.3kW 0.04m³
1 SY16C(T4f) Brochure Inquiry 1.83T 14.6kW 0.04m³
2 SY18C(T4f) Brochure Inquiry 1.96T 14.6kW 0.04m³
CodePudding user response:
The solution to fix this issue is to go back to the previous page "https://product.sanyglobal.com/excavator/mini_excavator/" the href URLs are visible in the HTML code so it is easy to scrap without the need to use "selenium" to render JavaScript.