Pandas dataframe from numpy array with multiindex-CodePudding

I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.

The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.

I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.

Suggestions are more than welcome!

CodePudding user response：

IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:

a = np.random.random((5, 359, 2))

out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0] 1),
                   columns=['mean', 'uncertainty']).stack()

Output (a Series):

1  mean           0.499102
   uncertainty    0.511757
2  mean           0.480295
   uncertainty    0.473132
3  mean           0.500507
   uncertainty    0.519352
4  mean           0.505443
   uncertainty    0.493672
5  mean           0.514302
   uncertainty    0.519299
dtype: float64

For a DataFrame:

out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0] 1),
                   columns=['mean', 'uncertainty']).stack().to_frame('value')

Output:

                  value
1 mean         0.499102
  uncertainty  0.511757
2 mean         0.480295
  uncertainty  0.473132
3 mean         0.500507
  uncertainty  0.519352
4 mean         0.505443
  uncertainty  0.493672
5 mean         0.514302
  uncertainty  0.519299

CodePudding user response：

I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.

import numpy as np
import pandas as pd

a = np.random.rand(5, 10, 2)

# Get the shape
n_experiments, n_observations, n_values = a.shape

# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)

# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])

# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation

The Dataframe now looks like this:

print(df.head(15))

      mean  uncertainty  experiment  observation
0   0.741436     0.775086           0            0
1   0.401934     0.277716           0            1
2   0.148269     0.406040           0            2
3   0.852485     0.702986           0            3
4   0.240930     0.644746           0            4
5   0.309648     0.914761           0            5
6   0.479186     0.495845           0            6
7   0.154647     0.422658           0            7
8   0.381012     0.756473           0            8
9   0.939797     0.764821           0            9
10  0.994342     0.019140           1            0
11  0.300225     0.992146           1            1
12  0.265698     0.823469           1            2
13  0.791907     0.555051           1            3
14  0.503281     0.249237           1            4

Now you can analyze the Dataframe (with groupby and mean):

# Only the mean 
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())


                 mean  uncertainty
observation                       
0            0.699324     0.506369
1            0.382288     0.456324
2            0.333396     0.324469
3            0.690545     0.564583
4            0.365198     0.555231
5            0.453545     0.596149
6            0.526988     0.395162
7            0.565689     0.569904
8            0.425595     0.415944
9            0.731776     0.375612

Or with more advanced aggregate functions, which are probably useful for your usecase:

# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))



                 mean                     uncertainty                    
                 mean       min       max        mean       min       max
observation                                                              
0            0.699324  0.297030  0.994342    0.506369  0.019140  0.974842
1            0.382288  0.063046  0.810411    0.456324  0.108774  0.992146
2            0.333396  0.148269  0.698921    0.324469  0.009539  0.823469
3            0.690545  0.175471  0.895190    0.564583  0.260557  0.721265
4            0.365198  0.015501  0.726352    0.555231  0.249237  0.929258
5            0.453545  0.111355  0.807582    0.596149  0.101421  0.914761
6            0.526988  0.323945  0.786167    0.395162  0.007105  0.691998
7            0.565689  0.154647  0.813336    0.569904  0.302157  0.964782
8            0.425595  0.116968  0.567544    0.415944  0.014439  0.756473
9            0.731776  0.411324  0.939797    0.375612  0.085988  0.764821