Home > Net >  Dificulties webscraping, URL doesnt change on search, remain the same and for every item searched th
Dificulties webscraping, URL doesnt change on search, remain the same and for every item searched th

Time:03-23

I have been thinking first about theorytical way to webscrape this page

https://www.mercadopublico.cl/Home is the Chilean government open business where you can apply to get deliver some services to the state.

Mercadopublico

so I search Camas ( mean "bed" in Spanish) bed search

So the first barrier i have found is that the URL doesn't change at all with my search:https://www.mercadopublico.cl/Home/BusquedaLicitacion will be the same at any search

url don't change

the second barrier , wont change either if I change to the next page. so I cant code a URL changing type on a array as I would like to do.

the third barrier is the most information I want

is in another pop up window from the main one that doesn't change

pop up window

there the information could be downloaded in a CSV or JSON or either would be webscraped from the pop-up window.

But so far I am not able to get the solution for the part that url doesnt change when I change the search or the page. So I wasn't able to think so far because I cant get the first part to be done.

I think that webscrape the popup would be the easier because in that point I already have an URL.(the pop up window does have a different URL!)

If you know how or if i need another metodology to do it ( since right know I have been only using BS4 for do it) please let me know in what direction should I walk.

here is my first error i dont know how to solve with ussual code, if you help me with that i cant go further, that is change URL to get the matrix url, because i cant use range method

 # -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'

#problem here because i cant navigate beacuse ajax doesnt let me
params = {
    'page': 0,
    'page1': 40,
}

results = []

for offset in range(0, 121, 40):  #  this method doesnt work on ajax page

    params['start'] = offset

    response = requests.get(url, params=params)
    print('url:', response.url)
    #print('status:', response.status_code)
                    
    soup = bs(response.text, "html.parser")

    all_products = soup.find_all('div', {'class': 'product-tile'})

    for product in all_products:
        itemid = product.get('data-itemid') 
        print('itemid:', itemid)

        data = product.get('data-product') 
        print('data:', data)
        
        name = product.find('span', {'itemprop': 'name'}).text
        print('name:', name)
        
        all_prices = product.find_all('div', {'class': 'price__text'})
        print('len(all_prices):', len(all_prices))
        
        price = all_prices[0].get('aria-label')
        print('price:', price)
        
        results.append( (itemid, name, price, data) )
        print('results')

# ---

# ... here you can save all `results` in file ...
import pandas as pd
df = pd.DataFrame(data = results[1:],columns = results[0])
df.to_excel('results.xlsx', index=False,header = False)#Writing to Excel file

So, I was trying right now to get the urls with this code modification

import requests
from bs4 import BeautifulSoup as bs    
from selenium import webdriver

#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
#implicit wait
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get('https://www.mercadopublico.cl/Home/BusquedaLicitacion')
#identify element
l =driver.find_element_by_xpath("//button[text()='Check it Now']")
#perform click
driver.execute_script("arguments[0].click();", l);

    
url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'
    
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
                        
soup = bs(response.text, "html.parser")
    
all_products = soup.find_all('a', {'href': '#'})
    
for product in all_products:
    itemurl = product.get('onclick') 
    print('itemurl:', itemurl)# hasta aca

#close browser
driver.quit()

but didnt get anything print , not sure wat failed.

Thanks very much.

CodePudding user response:

The URL doesn't change because it is making a post request with the search query.

POST https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar

And the request data is:

{
  "textoBusqueda":"camas",
  "idEstado":"5",
  "codigoRegion":"-1",
  "idTipoLicitacion":"-1",
  "fechaInicio":null,
  "fechaFin":null,
  "registrosPorPagina":"10",
  "idTipoFecha":[],
  "idOrden":"1",
  "compradores":[],
  "garantias":null,
  "rubros":[],
  "proveedores":[],
  "montoEstimadoTipo":[0],
  "esPublicoMontoEstimado":null,
  "pagina":0
}

There is also a cookie that may be needed __RequestVerificationToken_L0hvbWU1.

Then you can get the link to the pop-up in the HTML. It's inside the onclick property of the link.

If you need more help, just ask in the comment section.

Python Example: I've currently got it working until the final step. When I looked at the csv and json files, I realized they are both invalid. The site seems to attach some html at the bottom of both. I would recommend to just scrape the data from the last page, rather than downloading the csv/json.

import requests
from bs4 import BeautifulSoup


def get_headers(session):
    res = session.get("https://www.mercadopublico.cl/Home")
    if res.status_code == 200:
        print("Got headers")
        # return res.text
    else:
        print("Failed to get headers")



def search(session):
    data = {
        "textoBusqueda": "Camas",
        "idEstado": "5",
        "codigoRegion": "-1",
        "idTipoLicitacion": "-1",
        "fechaInicio": None,
        "fechaFin": None,
        "registrosPorPagina": "10",
        "idTipoFecha": [],
        "idOrden": "1",
        "compradores": [],
        "garantias": None,
        "rubros": [],
        "proveedores": [],
        "montoEstimadoTipo": [0],
        "esPublicoMontoEstimado": None,
        "pagina": 0
    }
    res = session.post(
        "https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar",
        data=data)
    if res.status_code == 200:
        print("Search succeeded")
        return res.text
    else:
        print("Search failed with error:", res.reason)



def get_popup_link(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links


def get_download_html(s, links):
    for link in links:
        res = s.get(link)
        if res.status_code == 200:
            print("fetch succeeded")
            return res.text
        else:
            print("fetch failed with error:", res.reason)

def get_download_links(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links

def main():
    with requests.Session() as s:
        get_headers(s)
        html = search(s)
        popup_links = get_popup_link(html)
        print(popup_links)
        download_html = get_download_html(s, popup_links)
        # print(download_html)

main()
  • Related