Home > database >  Website's URLs are not found or hidden, thus unable to scrape
Website's URLs are not found or hidden, thus unable to scrape

Time:08-22

I am currently a newbie with Scrapy and my issue is with this website: https://product.sanyglobal.com/excavator/mini_excavator/108/ If you scroll a bit to the middle you will see 3 types of equipment:

  • 'SY16C'
  • 'SY16 T4F'
  • 'SY18C'.

I tried examining the HTML code to find the URLs or a possible API request in the network tab, but I found none. Can you please explain to me how they are hidden and if possible a way to scrape those links?

CodePudding user response:

Your above mentioned data is populated by JavaScript meaning the webpage is dynamic and scrapy can't render JS. So you can use an automation tool something like selenium to grab the desired data. If you would like less complexities then you can try the next example selenium with bs and pandas.

Example:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url = 'https://product.sanyglobal.com/excavator/mini_excavator/108/'
driver.get(url)
driver.maximize_window()
time.sleep(3)

soup=BeautifulSoup(driver.page_source,'lxml')

for table in soup.select('table.el-table__body')[0:1]:

    df = pd.read_html(str(table))
    print(df[0])

Output:

                            0      1       2       3
0       SY16C Brochure  Inquiry  1.88T  10.3kW  0.04m³
1  SY16C(T4f) Brochure  Inquiry  1.83T  14.6kW  0.04m³
2  SY18C(T4f) Brochure  Inquiry  1.96T  14.6kW  0.04m³

CodePudding user response:

The solution to fix this issue is to go back to the previous page "https://product.sanyglobal.com/excavator/mini_excavator/" the href URLs are visible in the HTML code so it is easy to scrap without the need to use "selenium" to render JavaScript.

  • Related