pd.corr and np.corrcoef return different results-CodePudding

I noticed that correlation calculations return different values when using pandas vs numpy.

This is my sample data:

import numpy as np
import pandas as pd
import os

df = pd.DataFrame(
    {"name": ["a", "b", "c", "d", "e", "f"],
     "type": [float, float, float, float, float, float],
     "value": [2.121,np.nan,21.131,30.4242,100.424, 22.4341],
     "obs": [44, 55, 22, 77, 88, 33],
     "num": [66, 23, 62, 63, 23, 12]}
)

Correlation calculations:

pandas_corr = df.corr()
numeric_only_df = df.select_dtypes("number").dropna()
numpy_corr = pd.DataFrame(np.corrcoef(numeric_only_df, rowvar=False), columns=numeric_only_df.columns, index=numeric_only_df.columns)

Results: Using pandas:

          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.138357
num   -0.524068 -0.138357  1.000000

Using numpy:

          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.134928
num   -0.524068 -0.134928  1.000000

Some values are the same but some differ, I was wondering if anyone knew why and what would cause some of them to differ.

CodePudding user response：

The only difference is in the correlation coefficient between obs and num. The reason is that your result using pandas uses the row with index 1 while your result using numpy does not.

>>> df
  name             type     value  obs  num
0    a  <class 'float'>    2.1210   44   66
1    b  <class 'float'>       NaN   55   23
2    c  <class 'float'>   21.1310   22   62
3    d  <class 'float'>   30.4242   77   63
4    e  <class 'float'>  100.4240   88   23
5    f  <class 'float'>   22.4341   33   12
>>> numeric_only_df
      value  obs  num
0    2.1210   44   66
2   21.1310   22   62
3   30.4242   77   63
4  100.4240   88   23
5   22.4341   33   12

Note that for the correlation between obs and num, said row contains relevant information - it is only irrelevant for any correlation where one variable is value. Consequently, pandas is given more information than numpy - and pandas will not ignore this information.

You can do two things to verify my statement.

First, have a look at the source code of pd.corr to verify that pandas does not unnecessarily drop any data and that it uses np.corrcoef under the hood, so results must be identical.
Second, you can compute the correlation coefficient using both methods on the data without said row, i.e., df.iloc[df.index!=1,-3:], to see that they both produce the same result

>>> df.iloc[df.index!=1,-3:].corr()
          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.134928
num   -0.524068 -0.134928  1.000000
>>> np.corrcoef(df.iloc[df.index!=1,-3:], rowvar=False)
array([[ 1.        ,  0.73236541, -0.52406831],
       [ 0.73236541,  1.        , -0.13492801],
       [-0.52406831, -0.13492801,  1.        ]])