Home > database >  How to convert multiple excel files to CSV utf-8 encoding using python
How to convert multiple excel files to CSV utf-8 encoding using python

Time:05-27

I have 30 xlsx files in same directory and using python I would like to convert all files to csv with utf-8 encoding, regardless of whatever encoding is present in the file. I am using python's magic library to get the file names (below code).For conversion, I tried the code mention by SO user Julian here (I used the code posted here), but the code is throwing an error saying "InvalidFileException: openpyxl does not support file format, please check you can open it with Excel first. Supported formats are: .xlsx,.xlsm,.xltx,.xltm. Below is the code that is throwing an error.The second issue is based on my limited python knowledge I believe code will work for one excel file. How should I make it work for multiple files ?

Thanks in advance for your help!

# import a library to detect encodings
import magic
import glob

print("File".ljust(45), "Encoding")
for filename in glob.glob('path*.xlsx'):
    with open(filename, 'rb') as rawdata:
        result = magic.from_buffer(rawdata.read(2048))
    print(filename.ljust(45), result)

Code throwing error from SO User github link mentioned here

    from openpyxl import load_workbook
    import csv
    from os import sys
    
    def get_all_sheets(excel_file):
        sheets = []
        workbook = load_workbook(excel_file,read_only=True,data_only=True)
        all_worksheets = workbook.get_sheet_names()
        for worksheet_name in all_worksheets:
            sheets.append(worksheet_name)
        return sheets
    
    def csv_from_excel(excel_file, sheets):
        workbook = load_workbook(excel_file,data_only=True)
        for worksheet_name in sheets:
            print("Export "   worksheet_name   " ...")
    
            try:
                worksheet = workbook.get_sheet_by_name(worksheet_name)
            except KeyError:
                print("Could not find "   worksheet_name)
                sys.exit(1)
    
            your_csv_file = open(''.join([worksheet_name,'.csv']), 'wb')
            wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
            for row in worksheet.iter_rows():
                lrow = []
                for cell in row:
                    lrow.append(cell.value)
                wr.writerow(lrow)
            print(" ... done")
            your_csv_file.close()
    
    if not 2 <= len(sys.argv) <= 3:
        print("Call with "   sys.argv[0]   " <xlxs file> [comma separated list of sheets to export]")
        sys.exit(1)
    else:
        sheets = []
        if len(sys.argv) == 3:
            sheets = list(sys.argv[2].split(','))
        else:
            sheets = get_all_sheets(sys.argv[1])
        assert(sheets != None and len(sheets

) > 0)
    csv_from_excel(sys.argv[1], sheets)

CodePudding user response:

Have you tried to use Pandas library? You can store all the files in a list using os. You can then loop through the list and open each Excel file using read_excel and then write to a csv. So it will look something like this:

import pandas as pd
import os

directory = somepath #make sure this only has the .xlsx files that need converting

fileList = [f for f in os.listdir(directory)]

for file in fileList:
    fileNameCSV = str(file)[:-5]#to reuse the file name and remove the .xlsx
    df = pd.read_excel(directory '/' file, sheet_name = someSheetName)
    df.to_csv("{}.csv".format(fileNameCSV), encoding="utf-8")

Not the best but should meet your needs

CodePudding user response:

In first, the first error is obvious: InvalidFileException: openpyxl does not support file format, please check you can open it with Excel first.

Does Excel successfully open this file? If yes, we need the workbook (or small part of it).

The answer to the second question:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# vi:ts=4:et

"""I test to open multiple files."""

from pathlib import Path

from openpyxl import load_workbook

# find all *.xlsx files into current directory
# and iterate over it
for file in Path('.').glob('*.xlsx'):
    # read the Excel file
    wb = load_workbook(file)
    # small test (optional)
    print(wb.active.title)
    # do something for export to CSV
  • Related