I'm using Pandas version 1.4.1. Consider the following code, which works correctly, in which I compute the correlation of two dataframes using df.corrwith
# this works
aa = np.arange(6,dtype='float64')
bb = np.arange(6,dtype='float64')
dd1 = {'rating':aa}
dd2 = {'rating':bb}
df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)
print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))
print('df_1.shape = ', df_1.shape)
print('df_2.shape = ', df_2.shape)
print('-'*80)
The output is, as expected,
output of corrwith =
rating 1.0
dtype: float64
df_1.shape = (6, 1)
df_2.shape = (6, 1)
Now, consider the following code using pivot tables.
import pandas as pd
import numpy as np
# this does not work
# I'm working with dataframes of the same shape as before
# Only difference is that they're coming from pivot tables
dd3 = {'user': [1,2,1,2,3,4,5,1,2,6],
'movie': [1,2,1,3,4,5,6,1,3,2],
'rating':[1,2,1,5,4,3,3,2,1,5]
}
df_3 = pd.DataFrame(dd3)
df_pivot = pd.pivot_table(df_3,index='user',columns='movie',values='rating').fillna(0)
print(df_pivot)
print('-'*80)
df_4 = df_pivot.iloc[:,[1]]
df_5 = df_pivot.iloc[:,[2]]
print('output of corrwith = \n', df_4.corrwith(df_5,axis=0))
print('df_4.shape = ', df_4.shape)
print('df_5.shape = ', df_5.shape)
print('-'*80)
The output is
output of corrwith =
movie
2 NaN
3 NaN
dtype: float64
df_4.shape = (6, 1)
df_5.shape = (6, 1)
So, my question is: Why does df.corrwith
produce two NaNs
in the second case but only one value output (1.0
) in the first? And why is it producing NaNs
- if I do the correlation manually, it produces 0.2
.
CodePudding user response:
Explanation
The thing going wrong here is that the column names of both dataframes need to be the same. The following does not work (note the change in column name):
aa = np.arange(6,dtype='float64')
bb = np.arange(6,dtype='float64')
dd1 = {'rating':aa}
dd2 = {'ratings':bb}
df_1 = pd.DataFrame(dd1)
df_2 = pd.DataFrame(dd2)
print('output of corrwith = \n', df_1.corrwith(df_2,axis=0))
Output
output of corrwith =
rating NaN
ratings NaN
dtype: float64
Solution
You can use df_pivot.corrwith(df_pivot[2])
to get the correlation between every column and movie 2.
Output:
movie
1 -0.28
2 1.00
3 0.20
4 -0.28
5 -0.28
6 -0.28
dtype: float64