How to dynamically loop over a numpy Nd-array's layers and save into a pandas dataframe-CodePudding

I have a NumPy Nd-array and the shape of the array is (3, 3, 2). I want to calculate the mean and sd of the array over each set/layer and want to save them in a pandas dataframe. I can do this using the following code

import pandas as pd
import numpy as np

data_array = np.ndarray(shape=(2,3,2))
final_result = pd.DataFrame(
    {
        "Mean": np.array(data_array).mean(),
        "Mean_sd": np.array(data_array).mean(axis=0).std(ddof=1),
        "Mean_1": np.array(data_array[0]).mean(),
        "Mean_1_sd": np.array(data_array[0]).mean(axis=0).std(ddof=1),
        "Mean_2": np.array(data_array[1]).mean(),
        "Mean_2_sd": np.array(data_array[1]).mean(axis=0).std(ddof=1),
    },
    index=[0],
)

In the given example I have only 2 layers/sets. So, I took the layer/set number (i.e., [0] or [1]) manually to calculate the mean and sd.

"Mean_1": np.array(data_array[0]).mean(),
"Mean_2": np.array(data_array[1]).mean(),

But the real data_array is big (say, the shape of the array is (100, 3, 2)). So it is not possible (and eventually not the pythonic way) to take the layer/set numbers manually.

Is there any way to make it dynamic instead of taking layers/sets numbers manually and saving them in the pandas dataframe?

CodePudding user response：

You can use the axis argument to take the means and stds over the appropriate axes of your array so you only need to write each once. Then join the results to one big DataFrame (can do all within concat, but split out here for clarity).

import numpy as np
import pandas as pd
data_array = np.arange(200).reshape(4, 5, 10)

# Overall mean and std across all values
df1 = pd.DataFrame({"Mean": np.array(data_array).mean(),
                    "Mean_sd": np.array(data_array).mean(axis=0).std(ddof=1)}, index=[0])

# Mean collapsing the last two axes
df2 = pd.DataFrame([data_array.mean(axis=(-2, -1))], 
                    columns=[f'Mean_{i 1}' for i in range(data_array.shape[0])])

# Sd of the mean across the last axis. 
df3 = pd.DataFrame([data_array.mean(axis=-2).std(ddof=1, axis=-1)],
                    columns=[f'Mean_{i 1}_sd' for i in range(data_array.shape[0])])

res = pd.concat([df1, df2, df3], axis=1)

print(res)
   Mean   Mean_sd  Mean_1  Mean_2  Mean_3  Mean_4  Mean_1_sd  Mean_2_sd  Mean_3_sd  Mean_4_sd
0  99.5  14.57738    24.5    74.5   124.5   174.5    3.02765    3.02765    3.02765    3.02765

As a check, this is what your final_result output would be for my above input. It has fewer columns because you would manually need to create the others, given my input was larger.

   Mean   Mean_sd  Mean_1  Mean_1_sd  Mean_2  Mean_2_sd
0  99.5  14.57738    24.5    3.02765    74.5    3.02765

CodePudding user response：

You could also use pivot_longer function as below:

import numpy as np
import pandas as pd
import janitor

data_array = np.arange(200).reshape(4, 5, 10)

dat = {'Mean': data_array.mean(), 
 'MeanSd': data_array.mean(0).std(ddof = 1),
 'Means': data_array.mean((1,2)),
 'MeanSds': data_array.mean(1).std(1, ddof = 1),
 'name' : np.arange(data_array.shape[0])   1}

(pd.DataFrame(dat).
  pivot_wider(index = ('Mean', 'MeanSd'), names_from = 'name'))

The results:

    Mean    MeanSd  Means_1 Means_2 Means_3 Means_4 MeanSds_1   MeanSds_2   MeanSds_3   MeanSds_4
0   99.5    14.57738    24.5    74.5    124.5   174.5   3.02765 3.02765 3.02765 3.02765