I'm trying to get the correlation between a single column and the rest of the numerical columns of the data frame, but I'm stuck.
I'm trying with this: corr=IM['imdb_score'].corr(IM) But I get the error "operands could not be broadcast together with shapes", which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the data frame of several columns.
I'm a newbie at Python, thanks in advance for your help.
CodePudding user response:
I think you can you just use .corr
which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
CodePudding user response:
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
CodePudding user response:
The most efficient method it to use corrwith
.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64