Home > Enterprise >  Different np.std behaviours in pd.DataFrame.transform
Different np.std behaviours in pd.DataFrame.transform

Time:11-24

The default option of the degrees of freedom ddof in np.std is ddof=0.

When using np.std within pd.DataFrame.transform, this behavior changes:

import numpy as np
import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 9],
                   "group": ["a", "a", "a", "b", "b", "b"]})
std = df.groupby("group")["col1"].transform(np.std)

The output is

0    1.000000
1    1.000000
2    1.000000
3    2.645751
4    2.645751
5    2.645751
Name: col1, dtype: float64

Meanwhile, np.std([1, 2, 3]) = 0.816496580927726 and np.std([1, 2, 3], ddof=1) = 1.0.

It seems like there is a different std() or with different options used, when using pd.DataFrame.transform.

How can this be fixed?

CodePudding user response:

The easy fix is to add the ddof=0 to the pd.DataFrame.transform options:

import numpy as np
import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 9],
                   "group": ["a", "a", "a", "b", "b", "b"]})
std = df.groupby("group")["col1"].transform(np.std, ddof=0)  # <- fix

The output then is:

0    0.816497
1    0.816497
2    0.816497
3    2.160247
4    2.160247
5    2.160247
Name: col1, dtype: float64

See here for the similar answer for pd.DataFrame.apply: Different np.std behaviours in pd.apply and Pandas agg function gives different results for numpy std vs nanstd

CodePudding user response:

Simpliest is use pandas Series.std with default std=1:

std = df.groupby("group")["col1"].transform('std')
print (std)
0    1.000000
1    1.000000
2    1.000000
3    2.645751
4    2.645751
5    2.645751
Name: col1, dtype: float64
  • Related