How to get unique pair from nested loop in Python-CodePudding

I am trying to find correlations between dataframe columns using nested loop.

import itertools
for col1 in df.columns:
    for col2 in df.columns:
        if col1!=col2 and col1 not in (["Country","Status"]) and col2 not in (["Country","Status"]):
            correlation=round(df[col1].corr(df[col2]),2)
            if correlation<-0.5 or correlation>0.5:
                print(col1,col2,correlation)

This is the output for this code:

As you can see some results are the same. For example: LifeExpectancy BMI 0.57 and BMI LifeExpectancy 0.57

My question is how can I get only the unique results. I have heard about itertools combination method, but not sure how to use it in this case.

CodePudding user response：

Combinations appear twice because when doing the cartesian product you get all the ordered couples. In this case the order does not matter, so you can remove them.

In python sets are unordered collections, so you can add the results to a set, and then print the set.

x = set()
x.add(("a",1))
x.add(("b",2))
x.add(("a",1))
print(x) --> {('a', 1), ('b', 2)}

With this in mind, you can modify your code like this:

x = set()
for col1 in df.columns:
    for col2 in df.columns:
        if col1!=col2 and col1 not in (["Country","Status"]) and col2 not in (["Country","Status"]):
            correlation=round(df[col1].corr(df[col2]),2)
            if correlation<-0.5 or correlation>0.5:
                x.add((col1,col2,correlation))

print(x)

There could be a problem when hashing (this is what is done to determine if an element is already present in the set) a floating point, but since you are rounding already, this problem is minimal.

This also helps you find anomalies in your data because if you find a double entry it means that corr(x,y) != corr(y,x) which should not happen.

CodePudding user response：

Modify your loops: in the outer loop, iterate on each column but the last one; in the inner loop, iterate on each column past the current col1. Something like:

for i,col1 in enumerate(df.columns[:-1]):
    for col2 in df.columns[i 1:]:

In this way you will speed up your code by a 2x factor. Also, you can avoid checking if col1 and col2 are equal.