How to return most correlated features in three columns in Pandas?-CodePudding

Suppose I have a data frame like this:

df = pd.DataFrame(np.random.randint(0, 100, size = (5, 5)), columns = list('abcde'), index = list('abcde'))
print(df.corr())

          a         b         c         d         e
a  1.000000  0.598238 -0.623532 -0.187738  0.284429
b  0.598238  1.000000 -0.822820 -0.524259 -0.562846
c -0.623532 -0.822820  1.000000  0.097568  0.181560
d -0.187738 -0.524259  0.097568  1.000000  0.602838
e  0.284429 -0.562846  0.181560  0.602838  1.000000

Now I want a function that can give an output like:

top_corr(a, 3)

Features    Top Similar Features    Correlation
       a                       d           0.98
       a                       b           0.88
       a                       c           0.59

How can I do it?

CodePudding user response：

This should work. Edge cases are not handled though:

df = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)), columns=list('abcde'), index=list('abcde'))


def top_corr(data: pd.DataFrame, column: str, n: int):
    corr_matrix = data.corr()
    corr_sorted = corr_matrix[column].sort_values(key=lambda x: -x)[1:n   1]
    return pd.DataFrame({
        'Features': [column for _ in range(n)],
        'Top Similar Features': corr_sorted.index,
        'Correlation': corr_sorted.values
    })


result = top_corr(df, 'a', 3)
print(result)

Result:

  Features Top Similar Features  Correlation
0        a                    b     0.974219
1        a                    c     0.311234
2        a                    d    -0.075999

CodePudding user response：

(df.corr()
   .loc[var, df.columns.difference([var])]
   .nlargest(n))

query the row of the variable, and columns where everything but that variable is included. Then get the nlargest with a predefined n. This gets us a good place:

In [66]: var = "a"

In [67]: n = 3

In [68]: (df.corr()
    ...:    .loc[var, df.columns.difference([var])]
    ...:    .nlargest(n))
Out[68]:
e    0.779293
b    0.046749
c   -0.499404
Name: a, dtype: float64

Rest is aesthetics somewhat...

In [70]: (df.corr()
    ...:    .loc[var, df.columns.difference([var])]
    ...:    .nlargest(n)
    ...:    .reset_index(name="correlation")
    ...:    .rename(columns={"index": f"top {n} similar feats"})
    ...:    .assign(feature=var))
Out[70]:
  top 3 similar feats  correlation feature
0                   e     0.779293       a
1                   b     0.046749       a
2                   c    -0.499404       a

Converting this to a function of var, n is an exercise for the reader :)

(due to lack of seed, we get different results.)