The default option of the degrees of freedom ddof
in np.std
is ddof=0
.
When using np.std
within pd.DataFrame.transform
, this behavior changes:
import numpy as np
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 9],
"group": ["a", "a", "a", "b", "b", "b"]})
std = df.groupby("group")["col1"].transform(np.std)
The output is
0 1.000000
1 1.000000
2 1.000000
3 2.645751
4 2.645751
5 2.645751
Name: col1, dtype: float64
Meanwhile, np.std([1, 2, 3]) = 0.816496580927726
and np.std([1, 2, 3], ddof=1) = 1.0
.
It seems like there is a different std()
or with different options used, when using pd.DataFrame.transform
.
How can this be fixed?
CodePudding user response:
The easy fix is to add the ddof=0
to the pd.DataFrame.transform
options:
import numpy as np
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 9],
"group": ["a", "a", "a", "b", "b", "b"]})
std = df.groupby("group")["col1"].transform(np.std, ddof=0) # <- fix
The output then is:
0 0.816497
1 0.816497
2 0.816497
3 2.160247
4 2.160247
5 2.160247
Name: col1, dtype: float64
See here for the similar answer for pd.DataFrame.apply
: Different np.std behaviours in pd.apply and Pandas agg function gives different results for numpy std vs nanstd
CodePudding user response:
Simpliest is use pandas Series.std
with default std=1
:
std = df.groupby("group")["col1"].transform('std')
print (std)
0 1.000000
1 1.000000
2 1.000000
3 2.645751
4 2.645751
5 2.645751
Name: col1, dtype: float64