Merge excel files with multiple sheets into one dataframe-CodePudding

I'm new to pd python and I'm trying to combine a lot of excel files from a folder (each file contains two sheets) and then add only certain columns from those sheets to the new dataframe. Each file has the same quantity of columns and sheet names, but sometimes a different number of rows.

I'll show you what I did with an example with two files. Screens of the sheets:

First sheet

Second sheet

Sheets from the second file have the same structure, but with different data in it.

Code:

import pandas as pd
import os

folder = [file for file in os.listdir('./test_folder/')]

consolidated = pd.DataFrame()

for file in folder:
    first = pd.concat(pd.read_excel('./test_folder/' file, sheet_name=['first']))
    second = pd.concat(pd.read_excel('./test_folder/' file, sheet_name=['second']))
    first_new = first.drop(['Col_K', 'Col_L', 'Col_M'], axis=1) #dropping unnecessary columns
    second_new = second.drop(['Col_DD', 'Col_EE', 'Col_FF','Col_GG','Col_HH', 'Col_II', 'Col_JJ', 'Col_KK', 'Col_LL', 'Col_MM', 'Col_NN', 'Col_OO', 'Col_PP', 'Col_QQ', 'Col_RR', 'Col_SS', 'Col_TT'], axis=1) #dropping unnecessary columns
    frames = [consolidated, second_new, first_new]
    consolidated = pd.concat(frames, axis=0)

consolidated.to_excel('all.xlsx', index=True)

So here is a result

And here's my desired result

So basically, I do not know how to ignore these empty cells and align these two data frames with each other. Most likely there's some problem with DFs indexes(first_new, second_new), but I don't know how to resolve it

CodePudding user response：

pd.concat() has an ignore_index parameter, which you will need if your rows have differing indices across the individual frames. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.

Try:

pd.concat(frames, axis=1, ignore_index=True)

In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])

In [6]: df1
Out[6]:
   A  B
0  2  3
1  2  3

In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])

In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)

In [11]: df
Out[11]:
   0  1   2   3
0  2  3  22  33
1  2  3  22  33

In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)

In [13]: df
Out[13]:
   A  B  AAA  BBB
0  2  3   22   33
1  2  3   22   33