The Situation:
I am trying to scrape an advertisements website using scrapy. I could access and extract the information from the tags that are relevant to me, however, the "price" tag is reluctant to provide the text.
What I have tried:
For the price tag, I have tried different approaches, i.e. css selector and xpath, but nothing proved to work.
response.css('span.ma-AdPrice-value.ma-AdPrice-value--default.ma-AdPrice-value--heading--l::text').get()
response.xpath("//span[@class='ma-AdPrice-value ma-AdPrice-value--default ma-AdPrice-value--heading--l']/text()").get()
I tried different response
calls in the terminal with scrapy shell
, including extracting the data at upper tag levels, without success.
The Code:
Below an extract of the code. The title tag (and others) worked fine, the price tag is the problem here.
class AdSpider(scrapy.Spider):
name = 'ads'
start_urls = [
'https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm'
]
def parse(self, response):
title = response.css('h1.ma-AdDetail-title.ma-AdDetail-title-size-heading-m::text').get()
price = response.css('span.ma-AdPrice-value.ma-AdPrice-value--default.ma-AdPrice-value--heading--l::text').get()
items['title'] = title
items['price'] = price
yield items
The Question: I have no experience in webscraping so, is there any hidden feature in the website html code that I didn't see? How could I solve this issue?
CodePudding user response:
The issue here seems to be that the page requires JavaScript.
When running a test with requests and BeautifulSoup:
>>> import requests
>>> from bs4 import BeautifulSoup as bs
>>> res = requests.get("https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm")
>>> soup = bs(res.text, "lxml")
>>> soup.find("span", text="Precio financiado")
The item was not found as it had thrown an error of requiring javascript to continue along with bot protection.
When using selenium to collect it:
>>> from selenium import webdriver
>>> from bs4 import BeautifulSoup as bs
>>> path ="path_to_executable"
>>> driver = webdriver.Firefox(executable_path=path)
>>> driver.get("https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm")
>>> soup = bs(driver.page_source, "lxml")
>>> soup.find("span", text="Precio financiado").findNext("span").text
The expected price was returned.
To solve this you either need to look into incorporating selenium into scrapy or just only using selenium to collect the data.