Home > database >  Webscrapping not working with BeautifulSoup
Webscrapping not working with BeautifulSoup

Time:12-07

I am trying to webscrapp the following site: https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia . However, when I try the code bellow I don't get anything from the table in the webpage. It seems beutifulsoup does not capture that information but it seems something related to how the webpage is built. Also when I try: req = Request(url, headers=headers), I get the forbidden error. How could I get the information from that table of the number of votes and the information in the top of valid and blanc votes?

    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0"
}
    url ="https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia"

 
    response = requests.get(url, headers=headers)
    soup =BeautifulSoup(response.content, 'html.parser')
    
    for EachPart in soup.select('div[class*="TablaListas___StyledDiv-sc-1dgusch-3 kLOQyO"]'):
        print EachPart.get_text()

CodePudding user response:

The url is loaded dynamically by javascript. If you make disabled javascript from your browser then you will notice that the content from the url goes disappeared that's why BeautifulSoup/requests can't gab data so you need an automation tool something like selenium. Here I use selenium with BeautifulSoup.

Script

from bs4 import BeautifulSoup
import time
from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)

url = 'https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia'
driver.get(url)
time.sleep(5)


soup = BeautifulSoup(driver.page_source, 'lxml')
divs = soup.select('.TablaListas__Table-sc-1dgusch-10.iTTvmp div li')
for div in divs:
    name= div.select_one('p.FilaTablaListas___StyledP-sc-1a79vk2-2.lgiJOC').text
    votes = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv ').text
    percentage = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv   span').text
    votes_in_percentage=percentage.replace(',','.')
    print('name:'  name,'votes:'  votes, 'votes_in_percentage:'  votes_in_percentage,end='\n\n')

Output

name:PARTIDO CENTRO DEMOCRATICO votes:321 votes_in_percentage:27.18%

name:PARTIDO LIBERAL COLOMBIANO votes:249 votes_in_percentage:21.08%

name:MOVIMIENTO ALTERNATIVO INDIGENA Y SOCIAL "MAIS" votes:233 votes_in_percentage:19.72%    

name:PARTIDO SOCIAL DE UNIDAD NACIONAL "PARTIDO DE LA U" votes:157 votes_in_percentage:13.29%

name:PARTIDO CONSERVADOR COLOMBIANO votes:47 votes_in_percentage:3.97%

name:JAC BNUEVO votes:38 votes_in_percentage:3.21%

name:PARTIDO ALIANZA VERDE votes:37 votes_in_percentage:3.13%

name:JAC BARRIO CENTRO votes:29 votes_in_percentage:2.45%

name:SEMILLAS DE IDENTIDAD Y AUTONOMIA votes:26 votes_in_percentage:2.20%

name:JAC BARRIO TAUCHI votes:19 votes_in_percentage:1.60%

name:JAC VICTORIA REGIA votes:13 votes_in_percentage:1.10%

name:PARTIDO CAMBIO RADICAL votes:12 votes_in_percentage:1.01%

CodePudding user response:

Try using another name for the "headers" dictionary. You could also try rewriting the response variable to:

response = requests.get(url, headers={'User-Agent': 'Chrome'})
  • Related