How to flatten a numpy array based on frequency to get the correct standard deviation?-CodePudding

I can very easily get the standard deviation of some numbers in a 1D list in numpy like below:

import numpy as np
arr1 = np.array([100, 100, 100, 200, 200, 500])
sd = np.std(arr1)
print(sd)

But my data is in the form of a 2D list, in which the second value of each inner list, is the frequency:

arr2 = np.array([[100, 3], [200, 2], [500, 1]])

How can I flatten it based on frequency (change arr2 into arr1), to get the correct standard deviation?

CodePudding user response：

Use arr2[:, 0].repeat(arr2[:, 1]).

CodePudding user response：

If you want the whole array flattened, you can use ravel()

arr2.ravel()
# output: array([100,   3, 200,   2, 500,   1])

If you want a specific column, you can select all rows and use the index of the column

arr2[:,1]
# output: array([3, 2, 1])
arr2[:,0]
# output: array([100, 200, 500])

To get the standard deviation, you can add .std() at the end

sd = arr2.ravel().std()
# or
sd = arr2[:,0].std()
# or 
sd = arr2[:,1].std()
# etc

CodePudding user response：

While @timgeb's (good) answer is the most straightforward, this might not be efficient if you have very large inputs such as np.array([[100, 3000], [200, 20000], [500, 100]])

In this case you can compute the standard deviation manually

v,r = arr2.T
n = r.sum()
avg = (v*r).sum()/n
std = np.sqrt((r*(v-avg)**2).sum()/n)

output: 141.4213562373095

Or use statsmodels:

from statsmodels.stats.weightstats import DescrStatsW

v,r = arr2.T
DescrStatsW(v, weights=r, ddof=0).std
# 141.4213562373095