How can I access the "price" tag within this response?-CodePudding

The Situation:

I am trying to scrape an advertisements website using scrapy. I could access and extract the information from the tags that are relevant to me, however, the "price" tag is reluctant to provide the text.

What I have tried:

For the price tag, I have tried different approaches, i.e. css selector and xpath, but nothing proved to work.

response.css('span.ma-AdPrice-value.ma-AdPrice-value--default.ma-AdPrice-value--heading--l::text').get()
response.xpath("//span[@class='ma-AdPrice-value ma-AdPrice-value--default ma-AdPrice-value--heading--l']/text()").get()

I tried different response calls in the terminal with scrapy shell, including extracting the data at upper tag levels, without success.

The Code:

Below an extract of the code. The title tag (and others) worked fine, the price tag is the problem here.

url: https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm

class AdSpider(scrapy.Spider):
    name = 'ads'
    start_urls = [
        'https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm'
    ]

    def parse(self, response):
        title = response.css('h1.ma-AdDetail-title.ma-AdDetail-title-size-heading-m::text').get()
        price = response.css('span.ma-AdPrice-value.ma-AdPrice-value--default.ma-AdPrice-value--heading--l::text').get()

        items['title'] = title
        items['price'] = price

        yield items

The Question: I have no experience in webscraping so, is there any hidden feature in the website html code that I didn't see? How could I solve this issue?

CodePudding user response：

The issue here seems to be that the page requires JavaScript.

When running a test with requests and BeautifulSoup:


>>> import requests
>>> from bs4 import BeautifulSoup as bs
>>> res = requests.get("https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm")
>>> soup = bs(res.text, "lxml")
>>> soup.find("span", text="Precio financiado")

The item was not found as it had thrown an error of requiring javascript to continue along with bot protection.

When using selenium to collect it:

>>> from selenium import webdriver
>>> from bs4 import BeautifulSoup as bs

>>> path ="path_to_executable"
>>> driver = webdriver.Firefox(executable_path=path)
>>> driver.get("https://www.milanuncios.com/volkswagen-de-segunda-mano/volkswagen-golf-2-0-tdi-184cv-dsg-gtd-bmt-409202986.htm")
>>> soup = bs(driver.page_source, "lxml")
>>> soup.find("span", text="Precio financiado").findNext("span").text

The expected price was returned.

To solve this you either need to look into incorporating selenium into scrapy or just only using selenium to collect the data.