As discussed here, pandas silently replaces NaN
values with 0 when calculating sums, in contrast to explicit calculations as shown here:
import pandas as pd
import numpy as np
np.NaN np.NaN # Result: nan
pd.DataFrame([np.NaN,np.NaN]).sum().item() # Result: 0.0
pandas' Descriptive Statistics methods have a skipna
argument. However, skipna
is by default True
, thereby masking the presence of missing values to casual users and novice programmers
This creates a risk that analyses will be "...quietly, accidentally wrong since their Pandas operators haven't used the correct skipna
" .
In Python, is there a way for users to set skipna=False
as the default option?
CodePudding user response:
It's quite straightforward as in the documentation.
skipna (bool, default True) - Exclude NA/null values when computing the result.
The skipna
paramter in the pd.DataFrame.sum()
method defaults to True
. So, when you take column sum, it skips the nan values and returns sum = 0.
If you set it to False
and you see the intended behavior. However, there is no way of defaulting it to False
. You have to set it to false via the parameter, unless you define your own wrapper around it.
import pandas as pd
import numpy as np
np.NaN np.NaN
pd.DataFrame([np.NaN,np.NaN]).sum(skipna=False)
0 NaN
dtype: float64
Here is a wrapper that can be defined to set your parameters to a custom value globally. This is code from this excellent SO answer.
## Function from -
## https://stackoverflow.com/questions/55877832/setting-pandas-global-default-for-skipna-to-false
def set_default(func, **default):
def inner(*args, **kwargs):
kwargs.update(default) # Update function kwargs w/ decorator defaults
return func(*args, **kwargs) # Call function w/ updated kwargs
return inner # Return decorated function
pd.DataFrame.sum = set_default(pd.DataFrame.sum, skipna=False)
pd.DataFrame([np.NaN,np.NaN]).sum()
0 NaN
dtype: float64