I'm new to pd python and I'm trying to combine a lot of excel files from a folder (each file contains two sheets) and then add only certain columns from those sheets to the new dataframe. Each file has the same quantity of columns and sheet names, but sometimes a different number of rows.
I'll show you what I did with an example with two files. Screens of the sheets:
Sheets from the second file have the same structure, but with different data in it.
Code:
import pandas as pd
import os
folder = [file for file in os.listdir('./test_folder/')]
consolidated = pd.DataFrame()
for file in folder:
first = pd.concat(pd.read_excel('./test_folder/' file, sheet_name=['first']))
second = pd.concat(pd.read_excel('./test_folder/' file, sheet_name=['second']))
first_new = first.drop(['Col_K', 'Col_L', 'Col_M'], axis=1) #dropping unnecessary columns
second_new = second.drop(['Col_DD', 'Col_EE', 'Col_FF','Col_GG','Col_HH', 'Col_II', 'Col_JJ', 'Col_KK', 'Col_LL', 'Col_MM', 'Col_NN', 'Col_OO', 'Col_PP', 'Col_QQ', 'Col_RR', 'Col_SS', 'Col_TT'], axis=1) #dropping unnecessary columns
frames = [consolidated, second_new, first_new]
consolidated = pd.concat(frames, axis=0)
consolidated.to_excel('all.xlsx', index=True)
So here is a result
And here's my desired result
So basically, I do not know how to ignore these empty cells and align these two data frames with each other. Most likely there's some problem with DFs indexes(first_new, second_new), but I don't know how to resolve it
CodePudding user response:
pd.concat()
has an ignore_index
parameter, which you will need if your rows have differing indices across the individual frames
. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.
Try:
pd.concat(frames, axis=1, ignore_index=True)
In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])
In [6]: df1
Out[6]:
A B
0 2 3
1 2 3
In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])
In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)
In [11]: df
Out[11]:
0 1 2 3
0 2 3 22 33
1 2 3 22 33
In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)
In [13]: df
Out[13]:
A B AAA BBB
0 2 3 22 33
1 2 3 22 33