Why do Pandas and NumPy treat their evaluation differently for some basic functions like the median?
Pandas automatically omits NaN values, NumPy does not.
import numpy as np
import pandas as pd
np.random.seed(10)
df = pd.DataFrame(np.random.randint(0, 10, size=10), columns=['x'])
df.loc[df.x > 1, 'x'] = np.NaN
print(df)
# x
#0 NaN
#1 NaN
#2 0.0
#3 1.0
#4 NaN
#5 0.0
#6 1.0
#7 NaN
#8 NaN
#9 0.0
print(df['x'].median())
#0.0
print(np.median(df['x']))
#nan
CodePudding user response:
They are 2 different libraries. They use different conventions/defaults.
If you want to ignore the NaN:
np.nanmedian(df['x'])
df['x'].median()
If you want to have a NaN result:
np.median(df['x'])
df['x'].median(skipna=False)