What is the fastest way to retrieve header names from excel files using pandas-CodePudding

I have a big size excel files that I'm organizing the column names into a unique list. The code below works, but it takes ~9 minutes! Does anyone have suggestions for speeding it up?

import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)

CodePudding user response：

Using pandas to extract just the column names of a large excel file is very inefficient. You can use openpyxl for this:

from openpyxl import load_workbook

wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)

columns = {}

for sheet in worksheets:
    for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
        columns = value

Assuming you only have one sheet, you will get a tuple of column names here.

CodePudding user response：

If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().

Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:

get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))

The stronger suggestion is to change datatype if you specifically want to address this issue.