Creating dataframe with multi level column index from from four 2d numpy arrays-CodePudding

I have four 2d numpy arrays:

import numpy as np
import pandas as pd    
x1 = np.array([[2, 4, 1],
       [2, 2, 1],
       [1, 3, 3],
       [2, 2, 1],
       [3, 3, 2]])
x2 = np.array([[1, 2, 2],
       [4, 1, 4],
       [1, 4, 4],
       [3, 3, 2],
       [2, 2, 4]])

x3 = np.array([[4, 3, 2],
       [4, 3, 2],
       [4, 3, 3],
       [1, 2, 2],
       [1, 4, 3]])      
x4 = np.array([[3, 1, 1],
       [3, 4, 3],
       [2, 2, 1],
       [2, 1, 1],
       [1, 2, 4]])

And I would like to create a dataframe as following:

level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
df = pd.DataFrame(np.concatenate((x1,x1,x3,x4),axis=1), columns=header)
df.index.name = 'Time'
df

Data in this DataFrame is not in the desired form.

I want the four columns (x1,x2,x3,x4) in the first level column label (location1) should be created by taking the first columns from all the numpy arrays. The next four columns (x1,x2,x3,x4) ie. the four columns in the second first level column label (location2) should be created by taking second columns from all four numpy arrays and so on. The length of first level column label ie. len(level_1_label) will be equal to the number of columns in all four 2d numpy arrays.

Desired DataFrame:

CodePudding user response：

One option is to reverse the order in creating the MultiIndex column (since level_1_label corresponds to the columns and level_2_label corresponds to the arrays); then swaplevel sort_index (to get it in the desired order) after building the DataFrame:

level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_2_label, level_1_label], names=['Variable','Location'])
df = pd.DataFrame(np.concatenate((x1,x2,x3,x4),axis=1), columns=header).swaplevel(axis=1).sort_index(level=0, axis=1)
df.index.name = 'Time'

Output:

Location    location1          location2          location3         
Variable    x1 x2 x3 x4        x1 x2 x3 x4        x1 x2 x3 x4
Time                                                          
0            2  1  4  3         4  2  3  1         1  2  2  1
1            2  4  4  3         2  1  3  4         1  4  2  3
2            1  1  4  2         3  4  3  2         3  4  3  1
3            2  3  1  2         2  3  2  1         1  2  2  1
4            3  2  1  1         3  2  4  2         2  4  3  4

CodePudding user response：

One option is to reshape the data in Fortran order, before creating the dataframe:


# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])

# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header)
df.index.name = 'Time'

df

Location location1          location2          location3
Variable        x1 x2 x3 x4        x1 x2 x3 x4        x1 x2 x3 x4
Time
0                2  1  4  3         4  2  3  1         1  2  2  1
1                2  4  4  3         2  1  3  4         1  4  2  3
2                1  1  4  2         3  4  3  2         3  4  3  1
3                2  3  1  2         2  3  2  1         1  2  2  1
4                3  2  1  1         3  2  4  2         2  4  3  4