Convert .xls file to latest version in Python-CodePudding

I did a web_scraping on a site that, through it, directs to a download of an .xls file, which replaces the old file in the destination folder, according to the Python script, below:

My script:

import time 
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import shutil
import os
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import matplotlib



driver = webdriver.Chrome(ChromeDriverManager().install())

driver = webdriver.Chrome()

driver.get('http://estatisticas.cetip.com.br/astec/series_v05/paginas/lum_web_v05_series_introducao.asp?str_Modulo=Ativo&int_Idioma=1&int_Titulo=6&int_NivelBD=2/')
driver.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div/dl/dd[2]/a').click()
time.sleep(3)
driver.switch_to.frame(driver.find_element(By.XPATH, '//iframe[@name="dados_corpo"]'))
driver.switch_to.frame(driver.find_element(By.XPATH, '//frame[@name="ativo"]'))
find_dp1 = driver.find_element(By.XPATH, '//select[@name="ativo"]')
select_find_dp1 = Select(find_dp1)
select_find_dp1.select_by_visible_text("CBIO - Crédito de descarbonização")
time.sleep(3)

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[1])

time.sleep(1)
informacoes = Select(driver.find_element(By.NAME, 'selectopcoes'))
informacoes.select_by_visible_text('Estoque')
    
driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[2])

time.sleep(1)
# Data Inicial 
driver.find_element(By.NAME, 'DT_DIA_DE').send_keys('16')
driver.find_element(By.NAME, 'DT_MES_DE').send_keys('10')
driver.find_element(By.NAME, 'DT_ANO_DE').send_keys('2020')

# Data Final
driver.find_element(By.NAME, 'DT_DIA_ATE').send_keys('10')
driver.find_element(By.NAME, 'DT_MES_ATE').send_keys('02')
driver.find_element(By.NAME, 'DT_ANO_ATE').send_keys('2022')

driver.find_elements(By.CLASS_NAME, 'button')[1].click()

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'iframe'))
time.sleep(1)
driver.find_element(By.CLASS_NAME, 'primary-text').find_element(By.TAG_NAME,'a').click()

time.sleep(4)

origem = 'C:\\Users\\prmatteo\\Downloads\\'
destino = os.path.join(origem, 'C:\\Users\\prmatteo\\xxx\\Área de Trabalho\\Arquivos Python\\renovabio2.xls')
extensao = '.xls'

for file in os.listdir(origem):
    if file.endswith(extensao):
        shutil.move(os.path.join(origem,file), destino)

It always downloads .xls file in old excel format. I would like to convert it to the latest excel version when I go to replace it in the destination so that it doesn't open in compatibility mode format on pc.

CodePudding user response：

I tried to download a file from that site and unfortunately for you, it doesn't produce Excel files at all. A lot of sites fake Excel exports by generating a CSV file or an HTML file with a table and a fake xls extension. Excel recognizes this and tries to import the file as if it was text or HTML and shows you a warning.

Unfortunately, in this case the file isn't even a CSV. It's the result page saved as text, including the headers. It doesn't even use UTF8 so non-US characters get mangled. The table data is just some rows with tabs as delimiters :

B3
ADA - Alongamento da Dívida Agrícola - Estoque

De: 20/03/2022 -->  Até: 18/04/2022

Valores Financeiros em Reais.
Estoque Valorizado.
Metologia de cálculo: Preço Unitário da Curva x Quantidade Depositada na data.

Data    Volume
21/03/2022  0
22/03/2022  0
23/03/2022  0
24/03/2022  0
25/03/2022  0
28/03/2022  0
29/03/2022  0
30/03/2022  0a
31/03/2022  0
01/04/2022  0
04/04/2022  0
05/04/2022  0
06/04/2022  0
07/04/2022  0
08/04/2022  0
11/04/2022  0
12/04/2022  0
13/04/2022  0
14/04/2022  0
18/04/2022  0

You'll have to parse this text file and create an Excel file yourself. One way you could do this is to use Pandas to read the file as a CSV with read_csv skipping the first 9 rows, and then save it as Excel with to_excel:

import pandas as pd
filename="fake_excel.xls"
df=pd.read_csv(filename, sep='\t',skiprows=9)
# Display it, to see what we got
df

df.to_excel("real.xlsx")

The read_csv method allows you to specify different delimiters, skip header and footer rows, change the encoding etc.

One possible problem is the date format. Unless you specify otherwise, read_csv will import dates as text. You can tell it to parse cells that appear to be dates and even try to infer the date format with the proper parameters.

You can inspect the loaded data in several ways. If you type df you'll see the first and last few rows of the DataFrame. You can use df.info() to get the number of columns and their types etc.

CodePudding user response：

If you have xlrd and pandas installed, read the data into a DataFrame using pandas.read_excel. Then output the file to xlsx using pandas.to_excel.