Home > Blockchain >  Attempt to scrape search results from a site - Python
Attempt to scrape search results from a site - Python

Time:11-19

I needed to scrape the telefone numbers and the email addreses from the following using python:

url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos Aires'

source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')

print(soup)

The problem is that what I get from the requests.get is not the html that I need. I suppose the site uses javascript to show those results but I'm not familiar with that since I'm just starting with python programming. I solved this by copying the code of each result page to an unique text file and then extracting the emails with regex but I'm curious if there is something simple to be done to access the data directly.

CodePudding user response:

The data you see on the page is loaded from external URL via JavaScript. To get the data you can use requests/json modules, for example:

import json
import requests

api_url = "https://rmabackend.cultura.gob.ar/api/museos"

params = {
    "estado": "Publicado",
    "grupo": "Museo",
    "o": "p",
    "ordenar": "nombre_oficial_institucion",
    "page": 1,
    "page_size": "12",
    "provincias": "Buenos Aires",
}

while True:
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for d in data["data"]:
        print(d["attributes"]["nombre-oficial-institucion"])

    if params["page"] == data["meta"]["pagination"]["pages"]:
        break

    params["page"]  = 1

Prints:

2 Museos, Bellas Artes y MAC
Archivo Histórico y Museo "Astillero Río Santiago" (ARS)
Archivo Histórico y Museo del Servicio Penitenciario Bonaerense
Archivo y Museo Historico Municipal Roberto T. Barili "Villa Mitre"
Asociación Casa Bruzzone
Biblioteca Popular y Museo "José Manuel Estrada"
Casa Museo "Haroldo Conti"
Casa Museo "Xul Solar" -  Tigre
Complejo Histórico y Museográfico "Dr. Alfredo Antonio Sabaté"


...and so on.

CodePudding user response:

The page is using AJAX to load content. Using something like Selenium to simulate the browser will allow all the javascript to run and then you can extract the source:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos Aires'

# navigate to the page
driver.get(url)
# wait until a link with text 'ficha' has loaded
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'ficha')))
source = driver.page_source
soup = BeautifulSoup(source, features='lxml')
driver.quit()
  • Related