Given a table below
X Y pr
0 1 0.30
0 2 0.25
1 1 0.15
1 2 0.30
I want to create a custom function that calculates the covariance and variance between X and Y
I need to find the mean of both x and y, then subtract each value from the means obtained earlier. Then multiply the previous results
Here is the not so good code.
def cor(data_frame):
data_frame[['X']].mean()
data_frame[['Y']].mean()
cov = pd.merge(distr_table.groupby('X', as_index=False)['pr'].mean(), distr_table.groupby('Y', as_index=False)['pr'].mean(), how='cross')
I need to find a way to iterate and loop through. Thanks
CodePudding user response:
You don't need the loop. You can simply use the definition of covariance directly with
((data_frame['X']-(data_frame['X']*data_frame['pr']).sum())*
(data_frame['Y']-(data_frame['Y']*data_frame['pr']).sum())*
data_frame['pr']).sum()
However I would strongly recommend not reimplementing it as the function already exists within the numpy library
covariance_matrix = np.cov(data_frame['X'],data_frame['Y'],ddof=0,aweights=data_frame['pr'])
Returns the whole covariance matrix and you can access the covariance of X
and Y
using
covariance_matrix[0,1]
CodePudding user response:
Guys please don't kick my ass, If the code is lengthy and boring, I am a newbie. Please do comment, that will do help me to learn more.
Custom function and all individual values of Covariance and Correlation-coefficient. DataFrame with only X and Y columns.
def cov_corr(df):
mean_x = df['X'].sum()/len(df)
mean_y = df['Y'].sum()/len(df)
sub_x = [df['X'][i] - mean_x for i in df.index]
sub_y = [df['Y'][i] - mean_y for i in df.index]
cov_x = sum([i**2 for i in sub_x])/len(df)
cov_y = sum([i**2 for i in sub_y])/len(df)
cov_xy = sum([sub_x[i]*sub_y[i] for i in df.index])/len(df)
corr_xy = round(cov_xy/(np.sqrt(cov_x) * np.sqrt(cov_y)), 2)
keys = ['mean_x', 'mean_y', 'sub_x', 'sub_y', 'cov_x', 'cov_y', 'cov_xy', 'corr-xy']
all_vals = dict(zip(keys, [mean_x, mean_y, sub_x, sub_y, cov_x, cov_y, cov_xy, corr_xy]))
return all_vals