I am trying to webscrapp the following site: https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia . However, when I try the code bellow I don't get anything from the table in the webpage. It seems beutifulsoup does not capture that information but it seems something related to how the webpage is built. Also when I try: req = Request(url, headers=headers), I get the forbidden error. How could I get the information from that table of the number of votes and the information in the top of valid and blanc votes?

    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0"
}
    url ="https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia"

 
    response = requests.get(url, headers=headers)
    soup =BeautifulSoup(response.content, 'html.parser')
    
    for EachPart in soup.select('div[class*="TablaListas___StyledDiv-sc-1dgusch-3 kLOQyO"]'):
        print EachPart.get_text()

CodePudding user response：

The url is loaded dynamically by javascript. If you make disabled javascript from your browser then you will notice that the content from the url goes disappeared that's why BeautifulSoup/requests can't gab data so you need an automation tool something like selenium. Here I use selenium with BeautifulSoup.

Script

from bs4 import BeautifulSoup
import time
from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)

url = 'https://resultados.registraduria.gov.co/consejo/541/colombia/amazonas/leticia'
driver.get(url)
time.sleep(5)


soup = BeautifulSoup(driver.page_source, 'lxml')
divs = soup.select('.TablaListas__Table-sc-1dgusch-10.iTTvmp div li')
for div in divs:
    name= div.select_one('p.FilaTablaListas___StyledP-sc-1a79vk2-2.lgiJOC').text
    votes = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv ').text
    percentage = div.select_one('.FilaTablaListas__PorcentajeTexto-sc-1a79vk2-20.dYtRIv   span').text
    votes_in_percentage=percentage.replace(',','.')
    print('name:'  name,'votes:'  votes, 'votes_in_percentage:'  votes_in_percentage,end='\n\n')

Output

name:PARTIDO CENTRO DEMOCRATICO votes:321 votes_in_percentage:27.18%

name:PARTIDO LIBERAL COLOMBIANO votes:249 votes_in_percentage:21.08%

name:MOVIMIENTO ALTERNATIVO INDIGENA Y SOCIAL "MAIS" votes:233 votes_in_percentage:19.72%    

name:PARTIDO SOCIAL DE UNIDAD NACIONAL "PARTIDO DE LA U" votes:157 votes_in_percentage:13.29%

name:PARTIDO CONSERVADOR COLOMBIANO votes:47 votes_in_percentage:3.97%

name:JAC BNUEVO votes:38 votes_in_percentage:3.21%

name:PARTIDO ALIANZA VERDE votes:37 votes_in_percentage:3.13%

name:JAC BARRIO CENTRO votes:29 votes_in_percentage:2.45%

name:SEMILLAS DE IDENTIDAD Y AUTONOMIA votes:26 votes_in_percentage:2.20%

name:JAC BARRIO TAUCHI votes:19 votes_in_percentage:1.60%

name:JAC VICTORIA REGIA votes:13 votes_in_percentage:1.10%

name:PARTIDO CAMBIO RADICAL votes:12 votes_in_percentage:1.01%

CodePudding user response：

Try using another name for the "headers" dictionary. You could also try rewriting the response variable to:

response = requests.get(url, headers={'User-Agent': 'Chrome'})