Using Selenium to Download File in a dynamic Page-CodePudding

I'm trying to download a file using ChromeDriver Selenium for Python. Here is the page where the file is.

To access the file I have to click on the latest date (e.g. today is 22/06/2022) and then click on the link "Baixar Arquivo".

I'm trying to do it using Selenium with ChromeWebDriver. However, all methods I tried so far had Exceptions. The problem is the page have lots of nested elements and the divs have the same name. I have no idea how to solve this. What I've tried so far:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import traceback

service = ChromeService(executable_path=ChromeDriverManager().install())
op = webdriver.ChromeOptions()
op.add_argument('headless')

try:
    driver = webdriver.Chrome(service=service, options=op)
    driver.get('https://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/boletim-diario/dados-publicos-de-produtos-listados-e-de-balcao/')
    container = driver.find_element(by=By.XPATH, value='/html/body/div/div/div/div/div[2]/div[1]/div/div[1]/a/div/div/div') # didn't work
     container = driver.find_element(by=By.CSS_SELECTOR, value='html body div#wrapper div div div.container div.row div.col-lg-8.list div.accordion div.card div#collapse.collapse.show div.card-block.col-12 div.list-avatar.two-line div.list-avatar-row.tamanho div.content p.fonte.secondary-text a') #didn't work either
    container = driver.find_element(by=By.LINK_TEXT , value='Baixar Arquivo') # didn't work either
    print(container)
except Exception:
     print(traceback.format_exc())
finally:
    driver.close()

Every time I get the same error (no such element):

Traceback (most recent call last):
  File "/tmp/ipykernel_182073/1979137013.py", line 8, in <cell line: 5>
    container = driver.find_element(by=By.LINK_TEXT , value='Baixar Arquivo')
  File "/home/guilherme/miniconda3/envs/investment/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 1251, in find_element
    return self.execute(Command.FIND_ELEMENT, {
  File "/home/guilherme/miniconda3/envs/investment/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/home/guilherme/miniconda3/envs/investment/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"Baixar Arquivo"}
  (Session info: headless chrome=103.0.5060.53)
Stacktrace:
#0 0x561b67a2db13 <unknown>
#1 0x561b67834688 <unknown>
#2 0x561b6786bcc7 <unknown>
#3 0x561b6786be91 <unknown>
#4 0x561b6789ee34 <unknown>
#5 0x561b678898dd <unknown>
#6 0x561b6789cb94 <unknown>
#7 0x561b678897a3 <unknown>
#8 0x561b6785f0ea <unknown>
#9 0x561b67860225 <unknown>
#10 0x561b67a752dd <unknown>
#11 0x561b67a792c7 <unknown>
#12 0x561b67a5f22e <unknown>
#13 0x561b67a7a0a8 <unknown>
#14 0x561b67a53bc0 <unknown>
#15 0x561b67a966c8 <unknown>
#16 0x561b67a96848 <unknown>
#17 0x561b67ab0c0d <unknown>
#18 0x7f9208c4c609 <unknown>

I'm running out of ideas. PS: I tried those strategies in different sites and it worked. I'm assuming the problem is in the nested elements. How can I solve this?

CodePudding user response：

By looking at the requests you can use Python's requests. Analyzing it, you have 3 requests that are made. The first one will return a list with all the files (the first one is the most recent). A token request for a specific file, and with that token you can request the file.

import requests
import json

# this URL return a list with all files
page_url = "https://arquivos.b3.com.br/api/channels/34dcaaeb-0306-4f45-a83e-4f66a23b42fa/subchannels/cc188e40-03be-408e-aa86-501926b97a76/publications?&lang=pt"

# We get the list in json format
page_request = requests.get(page_url)
page_response = json.loads(page_request.content)

# You could loop through page_response to access all files,
# this is for the latest one
latest_file = page_response[0]

# We extract the file name and the date to use in the next request
file_name = latest_file["fileName"].split("File")[0]
date = latest_file["dateTime"].split("T")[0]

# We add the extracted info in this new url to get the token
token_url = f"https://arquivos.b3.com.br/api/download/requestname?fileName={file_name}&date={date}&recaptchaToken="

token_request = requests.get(token_url)
token_response = json.loads(token_request.content)

# We extract the token value from the response
token = token_response["redirectUrl"].split("?token=")[1]

# We call this URL to get the file we want
file_url = f"https://arquivos.b3.com.br/api/download/?token={token}"

file_request = requests.get(file_url)
file_response = file_request.content

# This response is the direct CSV content, so we can just save it directly
csv_file = open(latest_file["fileName"], "wb")
csv_file.write(file_response)
csv_file.close()

You can add the try and except and even make in OOP. With that list file, you can also map it and choose a file based on date and what not.

Hope this helps :)