I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape
. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!
CodePudding user response:
IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0] 1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0] 1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
CodePudding user response:
I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821