I noticed that correlation calculations return different values when using pandas vs numpy.
This is my sample data:
import numpy as np
import pandas as pd
import os
df = pd.DataFrame(
{"name": ["a", "b", "c", "d", "e", "f"],
"type": [float, float, float, float, float, float],
"value": [2.121,np.nan,21.131,30.4242,100.424, 22.4341],
"obs": [44, 55, 22, 77, 88, 33],
"num": [66, 23, 62, 63, 23, 12]}
)
Correlation calculations:
pandas_corr = df.corr()
numeric_only_df = df.select_dtypes("number").dropna()
numpy_corr = pd.DataFrame(np.corrcoef(numeric_only_df, rowvar=False), columns=numeric_only_df.columns, index=numeric_only_df.columns)
Results: Using pandas:
value obs num
value 1.000000 0.732365 -0.524068
obs 0.732365 1.000000 -0.138357
num -0.524068 -0.138357 1.000000
Using numpy:
value obs num
value 1.000000 0.732365 -0.524068
obs 0.732365 1.000000 -0.134928
num -0.524068 -0.134928 1.000000
Some values are the same but some differ, I was wondering if anyone knew why and what would cause some of them to differ.
CodePudding user response:
The only difference is in the correlation coefficient between obs
and num
. The reason is that your result using pandas uses the row with index 1 while your result using numpy does not.
>>> df
name type value obs num
0 a <class 'float'> 2.1210 44 66
1 b <class 'float'> NaN 55 23
2 c <class 'float'> 21.1310 22 62
3 d <class 'float'> 30.4242 77 63
4 e <class 'float'> 100.4240 88 23
5 f <class 'float'> 22.4341 33 12
>>> numeric_only_df
value obs num
0 2.1210 44 66
2 21.1310 22 62
3 30.4242 77 63
4 100.4240 88 23
5 22.4341 33 12
Note that for the correlation between obs
and num
, said row contains relevant information - it is only irrelevant for any correlation where one variable is value
. Consequently, pandas is given more information than numpy - and pandas will not ignore this information.
You can do two things to verify my statement.
- First, have a look at the source code of
pd.corr
to verify that pandas does not unnecessarily drop any data and that it usesnp.corrcoef
under the hood, so results must be identical. - Second, you can compute the correlation coefficient using both methods on the data without said row, i.e.,
df.iloc[df.index!=1,-3:]
, to see that they both produce the same result
>>> df.iloc[df.index!=1,-3:].corr()
value obs num
value 1.000000 0.732365 -0.524068
obs 0.732365 1.000000 -0.134928
num -0.524068 -0.134928 1.000000
>>> np.corrcoef(df.iloc[df.index!=1,-3:], rowvar=False)
array([[ 1. , 0.73236541, -0.52406831],
[ 0.73236541, 1. , -0.13492801],
[-0.52406831, -0.13492801, 1. ]])