I am trying to come up with heatmap for correlation and I realized some are wrong.
Below is my heatmap. As you can see, the number for the action are not appearing.
This is my dataframe
all_gen_cols = steamUniqueTitleGenre[['action', 'adventure','casual', 'indie','massively_multiplayer','rpg','racing','simulation','sports','strategy']]
action adventure casual indie massively_multiplayer rpg racing simulation sports strategy
0 1 0 0 0 0 0 0 0 0 0
1 1 1 0 0 1 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0 1
3 1 1 0 0 1 0 0 0 0 0
4 1 0 0 0 1 1 0 0 0 1
This is the code to produce the heatmap
def plot_correlation_heatmap(df):
corr = df.corr()
sb.set(style='white')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11,9))
cmap = sb.diverging_palette(220, 10, as_cmap=True)
sb.heatmap(corr, mask=mask, cmap=cmap, vmax=0.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.yticks(rotation=0)
plt.show()
plt.rcdefaults()
plot_correlation_heatmap(all_gen_cols)
I am not sure what is the error.
print(all_gen_cols.corr())
The result for coorelation is below. I saw NaN for action but i am not sure why it is Nan.
action adventure casual indie massively_multiplayer rpg racing simulation sports strategy
action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
adventure NaN 1.000000 0.007138 0.135392 0.023964 0.239136 -0.039846 0.036345 -0.064489 0.001435
casual NaN 0.007138 1.000000 0.235474 0.003487 -0.057726 0.079943 0.161448 0.149549 0.084417
indie NaN 0.135392 0.235474 1.000000 -0.082661 0.023372 0.045006 0.064723 0.056297 0.076749
massively_multiplayer NaN 0.023964 0.003487 -0.082661 1.000000 0.160078 0.036685 0.139929 0.018444 0.074683
rpg NaN 0.239136 -0.057726 0.023372 0.160078 1.000000 -0.046970 0.044506 -0.051714 0.097123
racing NaN -0.039846 0.079943 0.045006 0.036685 -0.046970 1.000000 0.127511 0.308864 -0.012170
simulation NaN 0.036345 0.161448 0.064723 0.139929 0.044506 0.127511 1.000000 0.212622 0.208754
sports NaN -0.064489 0.149549 0.056297 0.018444 -0.051714 0.308864 0.212622 1.000000 0.020048
strategy NaN 0.001435 0.084417 0.076749 0.074683 0.097123 -0.012170 0.208754 0.020048 1.000000
Below is by printing out print(all_gen_cols.describe())
action adventure casual indie massively_multiplayer rpg racing simulation sports strategy
count 14570.0 14570.000000 14570.000000 14570.000000 14570.000000 14570.000000 14570.000000 14570.000000 14570.000000 14570.000000
mean 1.0 0.362663 0.232189 0.657241 0.050927 0.165202 0.040288 0.121826 0.044269 0.127111
std 0.0 0.480785 0.422244 0.474648 0.219855 0.371376 0.196641 0.327096 0.205699 0.333108
min 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1.0 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1.0 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
Data
Since action = [1,1,...,1] => var(action) = 0
. Thus, the denominator of rho(action, Y)
(where Y
is any other column) is zero =>
rho(action, Y)
is undefined (NaN).
As suggested by other users, you should drop the 'action' column before computing the correlation matrix, since it doesn't add information.