Importing multiple excel files and combining into dataframe-CodePudding

I am trying to import many excel files (around 400) into one dataframe from a folder but I seem to be running into an error.

The files I want from my folder are names filename followed by a date - "filename_yyyy_mm_dd.xlsx".

I want to keep the header as the files have all same columns for different dates.

My current code is:

import glob
import pandas as pd
import os


path = r"C:\Users\..."

my_files = glob.glob(os.path.join(path, "filename*.xlsx"))

file_li = []

for filename in my_files:
    df = pd.read_excel(filename, index_col=None, header=1)
    file_li.append(df)
    
frame = pd.concat(file_li, axis=0, ignore_index=True)

When I call my frame I dont get any response? Am I doing something wrong in the way I am calling the file name?

Update:

My excel files look like this:

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6	Column 7	Column 8	Column 9	Column 10	Column 11	Column 12	Column 13	Column 14
Date	SREC-MD	SREC	Feb-25	MDX	F		85	0	0	8086	02/25/2025	20107

with around 300-400 rows.

My output has captured the 14 columns but it has added a lot more as doing frame.info() shows I have 922 columns.

Update 2:

CodePudding user response：

It's hard to tell why you're getting the extra columns but you can try this :

import glob
import pandas as pd
import os


path = r"C:\Users\..."

my_files = glob.glob(os.path.join(path, "filename*.xlsx"))

file_li = []

for filename in my_files:
    df = pd.read_excel(filename, index_col=None, header=None)
    file_li.append(df)
    
frame = (
            pd.concat(file_li, axis=0, ignore_index=True)
                .dropna(how="all") #to get rid of the eventual extra rows abobe each header
                .drop_duplicates() #to get rid of the cumulated duplicated headers
                .T.set_index(0).T #to make the first row as header of the dataframe
        )

I suggest you however to check if there is any missing rows in frame compared to your spreadsheets.

CodePudding user response：

Instead of using concat, you could try reading the files into a df and then append them to one combined csv using mode='a'. Then read the combined csv.

for filename in my_files:
       df = pd.read_excel(filename, index_col=None, header=1)
       df.to_csv('combined.csv', mode='a', header=False)


df = pd.read_csv('combined.csv')