Home > OS >  Multi-index Dataframe from dictionary of Dataframes
Multi-index Dataframe from dictionary of Dataframes

Time:12-23

I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.

Example

import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]],   index = dt_index, columns = column_names), 
           'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names), 
           'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}

Expected output:

               Y   X
2003-05-01 A   1   3
2003-05-01 B  12   3
2003-05-01 C   3  12
2003-05-02 A   7   4
2003-05-02 B   9   8
2003-05-02 C   5   1
2003-05-03 A   5   8
2003-05-03 B  75   0
2003-05-03 C  22   5

I've tried

pd.concat(df_dict, axis=0)

but this gives me the levels of the multi-index in the incorrect order.

Edit: Timings

Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.

Larger dummy data:

import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}

To convert the dictionary to a dataframe, albeit with swapped indicies takes:

%timeit pd.concat(df_dict, axis=0)
  63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!

%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
  528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
  1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

Use DataFrame.swaplevel with DataFrame.sort_index:

df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
               Y   X
2003-05-01 A   1   3
           B  12   3
           C   3  12
2003-05-02 A   7   4
           B   9   8
           C   5   1
2003-05-03 A   5   8
           B  75   0
           C  22   5

CodePudding user response:

You can reach down into numpy for a speed up if you can guarantee two things:

  1. Each of your DataFrames in df_dict have the exact same index
  2. Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}

out = pd.DataFrame(
     data=np.column_stack([*df_dict.values()]).reshape(-1, C),
     index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)

# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()

Timing:

%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
    data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
    index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

Use concat on axis=1 and stack:

out = pd.concat(df_dict, axis=1).stack(0)

Output:

               X   Y
2003-05-01 A   3   1
           B   3  12
           C  12   3
2003-05-02 A   4   7
           B   8   9
           C   1   5
2003-05-03 A   8   5
           B   0  75
           C   5  22
  • Related