Home > database >  pd.corr and np.corrcoef return different results
pd.corr and np.corrcoef return different results

Time:01-24

I noticed that correlation calculations return different values when using pandas vs numpy.

This is my sample data:

import numpy as np
import pandas as pd
import os

df = pd.DataFrame(
    {"name": ["a", "b", "c", "d", "e", "f"],
     "type": [float, float, float, float, float, float],
     "value": [2.121,np.nan,21.131,30.4242,100.424, 22.4341],
     "obs": [44, 55, 22, 77, 88, 33],
     "num": [66, 23, 62, 63, 23, 12]}
)

Correlation calculations:

pandas_corr = df.corr()
numeric_only_df = df.select_dtypes("number").dropna()
numpy_corr = pd.DataFrame(np.corrcoef(numeric_only_df, rowvar=False), columns=numeric_only_df.columns, index=numeric_only_df.columns)

Results: Using pandas:

          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.138357
num   -0.524068 -0.138357  1.000000

Using numpy:

          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.134928
num   -0.524068 -0.134928  1.000000

Some values are the same but some differ, I was wondering if anyone knew why and what would cause some of them to differ.

CodePudding user response:

The only difference is in the correlation coefficient between obs and num. The reason is that your result using pandas uses the row with index 1 while your result using numpy does not.

>>> df
  name             type     value  obs  num
0    a  <class 'float'>    2.1210   44   66
1    b  <class 'float'>       NaN   55   23
2    c  <class 'float'>   21.1310   22   62
3    d  <class 'float'>   30.4242   77   63
4    e  <class 'float'>  100.4240   88   23
5    f  <class 'float'>   22.4341   33   12
>>> numeric_only_df
      value  obs  num
0    2.1210   44   66
2   21.1310   22   62
3   30.4242   77   63
4  100.4240   88   23
5   22.4341   33   12

Note that for the correlation between obs and num, said row contains relevant information - it is only irrelevant for any correlation where one variable is value. Consequently, pandas is given more information than numpy - and pandas will not ignore this information.

You can do two things to verify my statement.

  • First, have a look at the source code of pd.corr to verify that pandas does not unnecessarily drop any data and that it uses np.corrcoef under the hood, so results must be identical.
  • Second, you can compute the correlation coefficient using both methods on the data without said row, i.e., df.iloc[df.index!=1,-3:], to see that they both produce the same result
>>> df.iloc[df.index!=1,-3:].corr()
          value       obs       num
value  1.000000  0.732365 -0.524068
obs    0.732365  1.000000 -0.134928
num   -0.524068 -0.134928  1.000000
>>> np.corrcoef(df.iloc[df.index!=1,-3:], rowvar=False)
array([[ 1.        ,  0.73236541, -0.52406831],
       [ 0.73236541,  1.        , -0.13492801],
       [-0.52406831, -0.13492801,  1.        ]])
  • Related