DataFrame with column MultiIndex, advanced to

I have a pandas dataframe with a structure following this moke-up:

import numpy as np
import pandas as pd 
import pprint as pp

np.random.seed(0)

times = np.linspace(0, 3.0, num=5)
positions = np.linspace(0, 0.1, num=8)
fields = ["g", "h"]

columns = pd.MultiIndex.from_product([times, fields], names=["time", "field"])
index = pd.Index(positions, name="position")


data = np.random.randn(len(positions), len(times)*len(fields))

df = pd.DataFrame(data, columns=columns, index=index)

print(df)

Which would look like:

time          0.00                0.75                1.50                2.25                3.00          
field            g         h         g         h         g         h         g         h         g         h
position                                                                                                    
0.000000  1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278  0.950088 -0.151357 -0.103219  0.410599
0.014286  0.144044  1.454274  0.761038  0.121675  0.443863  0.333674  1.494079 -0.205158  0.313068 -0.854096
0.028571 -2.552990  0.653619  0.864436 -0.742165  2.269755 -1.454366  0.045759 -0.187184  1.532779  1.469359
0.042857  0.154947  0.378163 -0.887786 -1.980796 -0.347912  0.156349  1.230291  1.202380 -0.387327 -0.302303
0.057143 -1.048553 -1.420018 -1.706270  1.950775 -0.509652 -0.438074 -1.252795  0.777490 -1.613898 -0.212740
0.071429 -0.895467  0.386902 -0.510805 -1.180632 -0.028182  0.428332  0.066517  0.302472 -0.634322 -0.362741
0.085714 -0.672460 -0.359553 -0.813146 -1.726283  0.177426 -0.401781 -1.630198  0.462782 -0.907298  0.051945
0.100000  0.729091  0.128983  1.139401 -1.234826  0.402342 -0.684810 -0.870797 -0.578850 -0.311553  0.056165

The idea being that I have a MultiIndex for the columns: on a first level I have a list of "times" and for each "time" I have multiple "fields".

For the real case scenario, the number of "positions", "times" and "fields" being way larger.

My goal is to convert this data frame to a dictionary, grouping every "time" of a given "field" as an array.

To be clearer, I would like to generate something like this:

{'g': array([[ 1.76405235,  0.97873798,  1.86755799,  0.95008842, -0.10321885],
       [ 0.14404357,  0.76103773,  0.44386323,  1.49407907,  0.3130677 ],
       [-2.55298982,  0.8644362 ,  2.26975462,  0.04575852,  1.53277921],
       [ 0.15494743, -0.88778575, -0.34791215,  1.23029068, -0.38732682],
       [-1.04855297, -1.70627019, -0.50965218, -1.25279536, -1.61389785],
       [-0.89546656, -0.51080514, -0.02818223,  0.06651722, -0.63432209],
       [-0.67246045, -0.81314628,  0.17742614, -1.63019835, -0.90729836],
       [ 0.72909056,  1.13940068,  0.40234164, -0.87079715, -0.31155253]]),
 'h': array([[ 0.40015721,  2.2408932 , -0.97727788, -0.15135721,  0.4105985 ],
       [ 1.45427351,  0.12167502,  0.33367433, -0.20515826, -0.85409574],
       [ 0.6536186 , -0.74216502, -1.45436567, -0.18718385,  1.46935877],
       [ 0.37816252, -1.98079647,  0.15634897,  1.20237985, -0.30230275],
       [-1.42001794,  1.9507754 , -0.4380743 ,  0.77749036, -0.21274028],
       [ 0.3869025 , -1.18063218,  0.42833187,  0.3024719 , -0.36274117],
       [-0.35955316, -1.7262826 , -0.40178094,  0.46278226,  0.0519454 ],
       [ 0.12898291, -1.23482582, -0.68481009, -0.57884966,  0.05616534]]),
 'position': array([0.        , 0.01428571, 0.02857143, 0.04285714, 0.05714286,
       0.07142857, 0.08571429, 0.1       ]),
 'time': array([0.  , 0.75, 1.5 , 2.25, 3.  ])}

Which can be manually built specifically from this moke-up with:

output = {'position': positions,
'time': times,
fields[0] : data[:, ::len(fields)],
fields[1] : data[:, 1::len(fields)]
}

pp.pprint(output)

I was thinking to something around df.to_dict('list') in a similar way to what is described here: https://stackoverflow.com/a/39074579/10812478

CodePudding user response：

You can use groupby and a dictionary comprehension for the field arrays, and add the other keys afterwards:

d = {k: d.to_numpy() for k,d in df.groupby(level='field', axis=1)}
d['position'] = df.index.to_numpy()
d['time'] = df.stack('field').columns.get_level_values('time').to_numpy()

NB. I used np.random.seed(0) to generate the input

output:

{'g': array([[ 1.76405235,  0.97873798,  1.86755799,  0.95008842, -0.10321885],
       [ 0.14404357,  0.76103773,  0.44386323,  1.49407907,  0.3130677 ],
       [-2.55298982,  0.8644362 ,  2.26975462,  0.04575852,  1.53277921],
       [ 0.15494743, -0.88778575, -0.34791215,  1.23029068, -0.38732682],
       [-1.04855297, -1.70627019, -0.50965218, -1.25279536, -1.61389785],
       [-0.89546656, -0.51080514, -0.02818223,  0.06651722, -0.63432209],
       [-0.67246045, -0.81314628,  0.17742614, -1.63019835, -0.90729836],
       [ 0.72909056,  1.13940068,  0.40234164, -0.87079715, -0.31155253]]),
 'h': array([[ 0.40015721,  2.2408932 , -0.97727788, -0.15135721,  0.4105985 ],
       [ 1.45427351,  0.12167502,  0.33367433, -0.20515826, -0.85409574],
       [ 0.6536186 , -0.74216502, -1.45436567, -0.18718385,  1.46935877],
       [ 0.37816252, -1.98079647,  0.15634897,  1.20237985, -0.30230275],
       [-1.42001794,  1.9507754 , -0.4380743 ,  0.77749036, -0.21274028],
       [ 0.3869025 , -1.18063218,  0.42833187,  0.3024719 , -0.36274117],
       [-0.35955316, -1.7262826 , -0.40178094,  0.46278226,  0.0519454 ],
       [ 0.12898291, -1.23482582, -0.68481009, -0.57884966,  0.05616534]]),
 'position': array([0.        , 0.01428571, 0.02857143, 0.04285714, 0.05714286,
       0.07142857, 0.08571429, 0.1       ]),
 'time': array([0.  , 0.75, 1.5 , 2.25, 3.  ])}