strange behavior of pandas df.corrwith using a pivot table-CodePudding

I'm using Pandas version 1.4.1. Consider the following code, which works correctly, in which I compute the correlation of two dataframes using df.corrwith

# this works
aa  = np.arange(6,dtype='float64')
bb  = np.arange(6,dtype='float64')

dd1  = {'rating':aa}
dd2  = {'rating':bb}

df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)

print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))
print('df_1.shape = ', df_1.shape)
print('df_2.shape = ', df_2.shape)
print('-'*80)

The output is, as expected,

output of corrwith = 
 rating    1.0
dtype: float64
df_1.shape =  (6, 1)
df_2.shape =  (6, 1)

Now, consider the following code using pivot tables.

import pandas as pd
import numpy as np
# this does not work
# I'm working with dataframes of the same shape as before
# Only difference is that they're coming from pivot tables
dd3 = {'user':  [1,2,1,2,3,4,5,1,2,6],
       'movie': [1,2,1,3,4,5,6,1,3,2],
       'rating':[1,2,1,5,4,3,3,2,1,5]
       }
df_3 = pd.DataFrame(dd3)

df_pivot = pd.pivot_table(df_3,index='user',columns='movie',values='rating').fillna(0)

print(df_pivot)
print('-'*80)

df_4 = df_pivot.iloc[:,[1]]
df_5 = df_pivot.iloc[:,[2]]

print('output of corrwith = \n', df_4.corrwith(df_5,axis=0))
print('df_4.shape = ', df_4.shape)
print('df_5.shape = ', df_5.shape)
print('-'*80)

The output is

output of corrwith = 
 movie
2   NaN
3   NaN
dtype: float64
df_4.shape =  (6, 1)
df_5.shape =  (6, 1)

So, my question is: Why does df.corrwith produce two NaNs in the second case but only one value output (1.0) in the first? And why is it producing NaNs - if I do the correlation manually, it produces 0.2.

CodePudding user response：

Explanation

The thing going wrong here is that the column names of both dataframes need to be the same. The following does not work (note the change in column name):

aa  = np.arange(6,dtype='float64')
bb  = np.arange(6,dtype='float64')

dd1  = {'rating':aa}
dd2  = {'ratings':bb}

df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)

print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))

Output

output of corrwith = 
 rating    NaN
ratings   NaN
dtype: float64

Solution

You can use df_pivot.corrwith(df_pivot[2]) to get the correlation between every column and movie 2.

Output:

movie
1   -0.28
2    1.00
3    0.20
4   -0.28
5   -0.28
6   -0.28
dtype: float64