Home > Back-end >  strange behavior of pandas df.corrwith using a pivot table
strange behavior of pandas df.corrwith using a pivot table

Time:06-22

I'm using Pandas version 1.4.1. Consider the following code, which works correctly, in which I compute the correlation of two dataframes using df.corrwith

# this works
aa  = np.arange(6,dtype='float64')
bb  = np.arange(6,dtype='float64')

dd1  = {'rating':aa}
dd2  = {'rating':bb}

df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)

print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))
print('df_1.shape = ', df_1.shape)
print('df_2.shape = ', df_2.shape)
print('-'*80)

The output is, as expected,

output of corrwith = 
 rating    1.0
dtype: float64
df_1.shape =  (6, 1)
df_2.shape =  (6, 1)

Now, consider the following code using pivot tables.

import pandas as pd
import numpy as np
# this does not work
# I'm working with dataframes of the same shape as before
# Only difference is that they're coming from pivot tables
dd3 = {'user':  [1,2,1,2,3,4,5,1,2,6],
       'movie': [1,2,1,3,4,5,6,1,3,2],
       'rating':[1,2,1,5,4,3,3,2,1,5]
       }
df_3 = pd.DataFrame(dd3)

df_pivot = pd.pivot_table(df_3,index='user',columns='movie',values='rating').fillna(0)

print(df_pivot)
print('-'*80)

df_4 = df_pivot.iloc[:,[1]]
df_5 = df_pivot.iloc[:,[2]]

print('output of corrwith = \n', df_4.corrwith(df_5,axis=0))
print('df_4.shape = ', df_4.shape)
print('df_5.shape = ', df_5.shape)
print('-'*80)

The output is

output of corrwith = 
 movie
2   NaN
3   NaN
dtype: float64
df_4.shape =  (6, 1)
df_5.shape =  (6, 1)

So, my question is: Why does df.corrwith produce two NaNs in the second case but only one value output (1.0) in the first? And why is it producing NaNs - if I do the correlation manually, it produces 0.2.

CodePudding user response:

Explanation

The thing going wrong here is that the column names of both dataframes need to be the same. The following does not work (note the change in column name):

aa  = np.arange(6,dtype='float64')
bb  = np.arange(6,dtype='float64')

dd1  = {'rating':aa}
dd2  = {'ratings':bb}

df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)

print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))

Output

output of corrwith = 
 rating    NaN
ratings   NaN
dtype: float64

Solution

You can use df_pivot.corrwith(df_pivot[2]) to get the correlation between every column and movie 2.

Output:

movie
1   -0.28
2    1.00
3    0.20
4   -0.28
5   -0.28
6   -0.28
dtype: float64
  • Related