I have a NumPy Nd-array
and the shape of the array is (3, 3, 2)
. I want to calculate the mean
and sd
of the array over each set/layer
and want to save them in a pandas dataframe
. I can do this using the following code
import pandas as pd
import numpy as np
data_array = np.ndarray(shape=(2,3,2))
final_result = pd.DataFrame(
{
"Mean": np.array(data_array).mean(),
"Mean_sd": np.array(data_array).mean(axis=0).std(ddof=1),
"Mean_1": np.array(data_array[0]).mean(),
"Mean_1_sd": np.array(data_array[0]).mean(axis=0).std(ddof=1),
"Mean_2": np.array(data_array[1]).mean(),
"Mean_2_sd": np.array(data_array[1]).mean(axis=0).std(ddof=1),
},
index=[0],
)
In the given example I have only 2 layers/sets
. So, I took the layer/set number (i.e., [0] or [1]
) manually to calculate the mean and sd.
"Mean_1": np.array(data_array[0]).mean(),
"Mean_2": np.array(data_array[1]).mean(),
But the real data_array
is big (say, the shape of the array is (100, 3, 2)
). So it is not possible (and eventually not the pythonic way) to take the layer/set numbers manually.
Is there any way to make it dynamic instead of taking layers/sets numbers manually and saving them in the pandas dataframe?
CodePudding user response:
You can use the axis
argument to take the means and stds over the appropriate axes of your array so you only need to write each once. Then join the results to one big DataFrame (can do all within concat
, but split out here for clarity).
import numpy as np
import pandas as pd
data_array = np.arange(200).reshape(4, 5, 10)
# Overall mean and std across all values
df1 = pd.DataFrame({"Mean": np.array(data_array).mean(),
"Mean_sd": np.array(data_array).mean(axis=0).std(ddof=1)}, index=[0])
# Mean collapsing the last two axes
df2 = pd.DataFrame([data_array.mean(axis=(-2, -1))],
columns=[f'Mean_{i 1}' for i in range(data_array.shape[0])])
# Sd of the mean across the last axis.
df3 = pd.DataFrame([data_array.mean(axis=-2).std(ddof=1, axis=-1)],
columns=[f'Mean_{i 1}_sd' for i in range(data_array.shape[0])])
res = pd.concat([df1, df2, df3], axis=1)
print(res)
Mean Mean_sd Mean_1 Mean_2 Mean_3 Mean_4 Mean_1_sd Mean_2_sd Mean_3_sd Mean_4_sd
0 99.5 14.57738 24.5 74.5 124.5 174.5 3.02765 3.02765 3.02765 3.02765
As a check, this is what your final_result
output would be for my above input. It has fewer columns because you would manually need to create the others, given my input was larger.
Mean Mean_sd Mean_1 Mean_1_sd Mean_2 Mean_2_sd
0 99.5 14.57738 24.5 3.02765 74.5 3.02765
CodePudding user response:
You could also use pivot_longer function as below:
import numpy as np
import pandas as pd
import janitor
data_array = np.arange(200).reshape(4, 5, 10)
dat = {'Mean': data_array.mean(),
'MeanSd': data_array.mean(0).std(ddof = 1),
'Means': data_array.mean((1,2)),
'MeanSds': data_array.mean(1).std(1, ddof = 1),
'name' : np.arange(data_array.shape[0]) 1}
(pd.DataFrame(dat).
pivot_wider(index = ('Mean', 'MeanSd'), names_from = 'name'))
The results:
Mean MeanSd Means_1 Means_2 Means_3 Means_4 MeanSds_1 MeanSds_2 MeanSds_3 MeanSds_4
0 99.5 14.57738 24.5 74.5 124.5 174.5 3.02765 3.02765 3.02765 3.02765