I am trying to webscrapp the following site: https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia . However, when I try the code bellow I don't get anything from the table in the webpage. It seems beutifulsoup does not capture that information but it seems something related to how the webpage is built. Also when I try: req = Request(url, headers=headers), I get the forbidden error. How could I get the information from that table of the number of votes and the information in the top of valid and blanc votes?
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0"
}
url ="https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia"
response = requests.get(url, headers=headers)
soup =BeautifulSoup(response.content, 'html.parser')
for EachPart in soup.select('div[class*="TablaListas___StyledDiv-sc-1dgusch-3 kLOQyO"]'):
print EachPart.get_text()
CodePudding user response:
The url is loaded dynamically by javascript. If you make disabled javascript from your browser then you will notice that the content from the url goes disappeared that's why BeautifulSoup/requests can't gab data so you need an automation tool something like selenium. Here I use selenium with BeautifulSoup.
Script
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
divs = soup.select('.TablaListas__Table-sc-1dgusch-10.iTTvmp div li')
for div in divs:
name= div.select_one('p.FilaTablaListas___StyledP-sc-1a79vk2-2.lgiJOC').text
votes = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv ').text
percentage = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv span').text
votes_in_percentage=percentage.replace(',','.')
print('name:' name,'votes:' votes, 'votes_in_percentage:' votes_in_percentage,end='\n\n')
Output
name:PARTIDO CENTRO DEMOCRATICO votes:321 votes_in_percentage:27.18%
name:PARTIDO LIBERAL COLOMBIANO votes:249 votes_in_percentage:21.08%
name:MOVIMIENTO ALTERNATIVO INDIGENA Y SOCIAL "MAIS" votes:233 votes_in_percentage:19.72%
name:PARTIDO SOCIAL DE UNIDAD NACIONAL "PARTIDO DE LA U" votes:157 votes_in_percentage:13.29%
name:PARTIDO CONSERVADOR COLOMBIANO votes:47 votes_in_percentage:3.97%
name:JAC BNUEVO votes:38 votes_in_percentage:3.21%
name:PARTIDO ALIANZA VERDE votes:37 votes_in_percentage:3.13%
name:JAC BARRIO CENTRO votes:29 votes_in_percentage:2.45%
name:SEMILLAS DE IDENTIDAD Y AUTONOMIA votes:26 votes_in_percentage:2.20%
name:JAC BARRIO TAUCHI votes:19 votes_in_percentage:1.60%
name:JAC VICTORIA REGIA votes:13 votes_in_percentage:1.10%
name:PARTIDO CAMBIO RADICAL votes:12 votes_in_percentage:1.01%
CodePudding user response:
Try using another name for the "headers" dictionary. You could also try rewriting the response variable to:
response = requests.get(url, headers={'User-Agent': 'Chrome'})