I am trying to find out Pearson correlation using python loops on the "Server" field.
Logic is below- The first loop will iterate for each host, the second loop will iterate for each signal in that host and correlate that signal with the same signal for all other hosts, (third loop)if the correlation is > 0.6, need to increment the relationship by 1 b/w those hosts (host in 1st loop and host in 3rd loop).
I am having data.csv file as below
Server Signal1 Signal2
Host1 83.73 56.87
Host1 55.32 74.24
Host1 76.52 85.20
Host2 7.02 10.25
Host2 52.52 74.25
Host2 44.52 15.20
Host3 45.26 12.85
Host3 25.65 74.20
Host3 49.36 89.20
import pandas as pd
df=pd.read_csv("data.csv")
Server = df['Server'].tolist()
Signal1= df['Signal1'].tolist()
Signal2= df['Signal2'].tolist()
for device in Device:
for signal in Signal1:
if Device in Signal1:
corr, _ = pearsonr(device,signal)
print('Pearsons correlation: %.3f' % corr)
I tried building logic but that code is not working as I am not able to calculate Pearson correlation in for loop and validate condition of ">0.6".
CodePudding user response:
correlations={}
hosts={'hostname':{'signal1':[values], 'signal2':[values]},....}
arr=hosts.keys()
for i in arr:
correlations[i]={}
for i in range(len(arr)):
for j in range(i 1, len(arr)):
x = arr[i]
y = arr[j]
corr = calculate_correlation(hosts[x]['signal1'],hosts[y]['signal1'])
##put extra conditions here..for now just saving the result in correlations dict..same can be done for signal2
correlations[x][y] = corr
correlations[y][x] = corr
Also, numpy provides method for calculating PCC in case you want to avoid writing it on your own.
CodePudding user response:
Why do not use groupby_corr
:
# Setup
data = {'Server': ['Host1', 'Host1', 'Host1', 'Host2', 'Host2',
'Host2', 'Host3', 'Host3', 'Host3'],
'Signal1': [83.73, 55.32, 76.52, 7.02, 52.52, 44.52, 45.26, 25.65, 49.36],
'Signal2': [56.87, 74.24, 85.2, 10.25, 74.25, 15.2, 12.85, 74.2, 89.2]}
df = pd.DataFrame(data)
# Correlation
out = df.groupby('Server').corr(method='pearson')
print(out)
# Output
Signal1 Signal2
Server
Host1 Signal1 1.000000 -0.367667
Signal2 -0.367667 1.000000
Host2 Signal1 1.000000 0.687893
Signal2 0.687893 1.000000
Host3 Signal1 1.000000 -0.173744
Signal2 -0.173744 1.000000