I have been thinking first about theorytical way to webscrape this page
https://www.mercadopublico.cl/Home is the Chilean government open business where you can apply to get deliver some services to the state.
so I search Camas ( mean "bed" in Spanish) bed search
So the first barrier i have found is that the URL doesn't change at all with my search:https://www.mercadopublico.cl/Home/BusquedaLicitacion will be the same at any search
the second barrier , wont change either if I change to the next page. so I cant code a URL changing type on a array as I would like to do.
the third barrier is the most information I want
is in another pop up window from the main one that doesn't change
there the information could be downloaded in a CSV or JSON or either would be webscraped from the pop-up window.
But so far I am not able to get the solution for the part that url doesnt change when I change the search or the page. So I wasn't able to think so far because I cant get the first part to be done.
I think that webscrape the popup would be the easier because in that point I already have an URL.(the pop up window does have a different URL!)
If you know how or if i need another metodology to do it ( since right know I have been only using BS4 for do it) please let me know in what direction should I walk.
here is my first error i dont know how to solve with ussual code, if you help me with that i cant go further, that is change URL to get the matrix url, because i cant use range method
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'
#problem here because i cant navigate beacuse ajax doesnt let me
params = {
'page': 0,
'page1': 40,
}
results = []
for offset in range(0, 121, 40): # this method doesnt work on ajax page
params['start'] = offset
response = requests.get(url, params=params)
print('url:', response.url)
#print('status:', response.status_code)
soup = bs(response.text, "html.parser")
all_products = soup.find_all('div', {'class': 'product-tile'})
for product in all_products:
itemid = product.get('data-itemid')
print('itemid:', itemid)
data = product.get('data-product')
print('data:', data)
name = product.find('span', {'itemprop': 'name'}).text
print('name:', name)
all_prices = product.find_all('div', {'class': 'price__text'})
print('len(all_prices):', len(all_prices))
price = all_prices[0].get('aria-label')
print('price:', price)
results.append( (itemid, name, price, data) )
print('results')
# ---
# ... here you can save all `results` in file ...
import pandas as pd
df = pd.DataFrame(data = results[1:],columns = results[0])
df.to_excel('results.xlsx', index=False,header = False)#Writing to Excel file
So, I was trying right now to get the urls with this code modification
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
#implicit wait
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get('https://www.mercadopublico.cl/Home/BusquedaLicitacion')
#identify element
l =driver.find_element_by_xpath("//button[text()='Check it Now']")
#perform click
driver.execute_script("arguments[0].click();", l);
url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
soup = bs(response.text, "html.parser")
all_products = soup.find_all('a', {'href': '#'})
for product in all_products:
itemurl = product.get('onclick')
print('itemurl:', itemurl)# hasta aca
#close browser
driver.quit()
but didnt get anything print , not sure wat failed.
Thanks very much.
CodePudding user response:
The URL doesn't change because it is making a post request with the search query.
POST https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar
And the request data is:
{
"textoBusqueda":"camas",
"idEstado":"5",
"codigoRegion":"-1",
"idTipoLicitacion":"-1",
"fechaInicio":null,
"fechaFin":null,
"registrosPorPagina":"10",
"idTipoFecha":[],
"idOrden":"1",
"compradores":[],
"garantias":null,
"rubros":[],
"proveedores":[],
"montoEstimadoTipo":[0],
"esPublicoMontoEstimado":null,
"pagina":0
}
There is also a cookie that may be needed __RequestVerificationToken_L0hvbWU1
.
Then you can get the link to the pop-up in the HTML. It's inside the onclick property of the link.
If you need more help, just ask in the comment section.
Python Example: I've currently got it working until the final step. When I looked at the csv and json files, I realized they are both invalid. The site seems to attach some html at the bottom of both. I would recommend to just scrape the data from the last page, rather than downloading the csv/json.
import requests
from bs4 import BeautifulSoup
def get_headers(session):
res = session.get("https://www.mercadopublico.cl/Home")
if res.status_code == 200:
print("Got headers")
# return res.text
else:
print("Failed to get headers")
def search(session):
data = {
"textoBusqueda": "Camas",
"idEstado": "5",
"codigoRegion": "-1",
"idTipoLicitacion": "-1",
"fechaInicio": None,
"fechaFin": None,
"registrosPorPagina": "10",
"idTipoFecha": [],
"idOrden": "1",
"compradores": [],
"garantias": None,
"rubros": [],
"proveedores": [],
"montoEstimadoTipo": [0],
"esPublicoMontoEstimado": None,
"pagina": 0
}
res = session.post(
"https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar",
data=data)
if res.status_code == 200:
print("Search succeeded")
return res.text
else:
print("Search failed with error:", res.reason)
def get_popup_link(html):
soup = BeautifulSoup(html, "html.parser")
dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
# clean onclick links
clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
return clean_links
def get_download_html(s, links):
for link in links:
res = s.get(link)
if res.status_code == 200:
print("fetch succeeded")
return res.text
else:
print("fetch failed with error:", res.reason)
def get_download_links(html):
soup = BeautifulSoup(html, "html.parser")
dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
# clean onclick links
clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
return clean_links
def main():
with requests.Session() as s:
get_headers(s)
html = search(s)
popup_links = get_popup_link(html)
print(popup_links)
download_html = get_download_html(s, popup_links)
# print(download_html)
main()