Pandas: Import xlsx with several sheets, add column to each df's with sheet’s name to which it-CodePudding

Given the following test file:

https://docs.google.com/spreadsheets/d/1rRUZirjPj2cBeaukUG8ngEowv80Nqg6N/edit?usp=sharing&ouid=100016243141159098340&rtpof=true&sd=true

I need to import the .xlsx file that has 4 sheets (this is only an example, my original file has many more sheets), add a column to each df's with sheet’s name to which it belongs to and then concatenate the resulting df's with those that had same number of columns.

In this example I have two sheets with 2 columns (I want those in the same dataframe), and another two sheets with one column each (which I want in only one dataframe).

What have I done so far?

my_dict = pd.read_excel('test.xlsx',header=0, sheet_name=None) #the output is a dictionary

for key, df in my_dict.items():
    df['sheet_name'] = key # This code creates a new column in each dataframe with the name of the sheet.

I don't know how to concatenate the dataframes that are inside the dictionary, to group them by the number of columns that each one has. The result here would be two different df's.

CodePudding user response：

Read in the data:

xlsx = pd.read_excel('test.xlsx', sheet_name = None)

Create two variables, one containing dataframes that have two columns, the other containing dataframes that have only one column :

two = {key:value for key,value in xlsx.items() if value.columns.size == 2}
one = {key:value for key,value in xlsx.items() if value.columns.size == 1}

Concatenate two and one individually:

two = pd.concat(two, names = ['sheet_name', None]).droplevel(-1).reset_index()

two

  sheet_name    A    B    C    D
0        JFK  1.0  2.0  NaN  NaN
1        JFK  5.0  6.0  NaN  NaN
2        MIA  NaN  NaN  1.0  1.0
3        MIA  NaN  NaN  2.0  2.0



 one = pd.concat(one, names = ['sheet_name', None]).droplevel(-1).reset_index()

one 

sheet_name  z
0        SJU  1
1        SJU  2
2        BCN  3
3        BCN  4

If you want the dataframe with two columns to have the same column names, you can do the preprocessing during the dictionary filtering phase:

two = {key:value.set_axis(['A', 'B'], axis = 'columns') 
       for key,value in xlsx.items() 
       if value.columns.size == 2}

# concatenation will result in only three columns:

two = pd.concat(two, names = ['sheet_name', None]).droplevel(-1).reset_index()

two

  sheet_name  A  B
0        JFK  1  2
1        JFK  5  6
2        MIA  1  1
3        MIA  2  2