Calculating Pearson correlation using python loops-CodePudding

I am trying to find out Pearson correlation using python loops on the "Server" field.

Logic is below- The first loop will iterate for each host, the second loop will iterate for each signal in that host and correlate that signal with the same signal for all other hosts, (third loop)if the correlation is > 0.6, need to increment the relationship by 1 b/w those hosts (host in 1st loop and host in 3rd loop).

I am having data.csv file as below

Server   Signal1    Signal2
Host1     83.73    56.87
Host1     55.32    74.24
Host1     76.52    85.20
Host2     7.02     10.25
Host2     52.52    74.25
Host2     44.52    15.20
Host3     45.26    12.85
Host3     25.65    74.20
Host3     49.36    89.20

import pandas as pd
df=pd.read_csv("data.csv")

Server = df['Server'].tolist()
Signal1= df['Signal1'].tolist()
Signal2= df['Signal2'].tolist()
for device in Device:
  for signal in Signal1:
    if Device in Signal1:
       corr, _ = pearsonr(device,signal)
       print('Pearsons correlation: %.3f' % corr)

I tried building logic but that code is not working as I am not able to calculate Pearson correlation in for loop and validate condition of ">0.6".

CodePudding user response：

correlations={}
hosts={'hostname':{'signal1':[values], 'signal2':[values]},....}
arr=hosts.keys()
for i in arr:
    correlations[i]={}
for i in range(len(arr)):
    for j in range(i 1, len(arr)):
    x = arr[i]
    y = arr[j]
    corr = calculate_correlation(hosts[x]['signal1'],hosts[y]['signal1'])
    ##put extra conditions here..for now just saving the result in correlations dict..same can be done for signal2
    correlations[x][y] = corr
    correlations[y][x] = corr

Also, numpy provides method for calculating PCC in case you want to avoid writing it on your own.

CodePudding user response：

Why do not use groupby_corr:

# Setup
data = {'Server': ['Host1', 'Host1', 'Host1', 'Host2', 'Host2',
                   'Host2', 'Host3', 'Host3', 'Host3'],
        'Signal1': [83.73, 55.32, 76.52, 7.02, 52.52, 44.52, 45.26, 25.65, 49.36], 
        'Signal2': [56.87, 74.24, 85.2, 10.25, 74.25, 15.2, 12.85, 74.2, 89.2]}
df = pd.DataFrame(data)

# Correlation
out = df.groupby('Server').corr(method='pearson')
print(out)

# Output
                 Signal1   Signal2
Server                            
Host1  Signal1  1.000000 -0.367667
       Signal2 -0.367667  1.000000
Host2  Signal1  1.000000  0.687893
       Signal2  0.687893  1.000000
Host3  Signal1  1.000000 -0.173744
       Signal2 -0.173744  1.000000