Suppose I have a data frame like this:
df = pd.DataFrame(np.random.randint(0, 100, size = (5, 5)), columns = list('abcde'), index = list('abcde'))
print(df.corr())
a b c d e
a 1.000000 0.598238 -0.623532 -0.187738 0.284429
b 0.598238 1.000000 -0.822820 -0.524259 -0.562846
c -0.623532 -0.822820 1.000000 0.097568 0.181560
d -0.187738 -0.524259 0.097568 1.000000 0.602838
e 0.284429 -0.562846 0.181560 0.602838 1.000000
Now I want a function that can give an output like:
top_corr(a, 3)
Features Top Similar Features Correlation
a d 0.98
a b 0.88
a c 0.59
How can I do it?
CodePudding user response:
This should work. Edge cases are not handled though:
df = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)), columns=list('abcde'), index=list('abcde'))
def top_corr(data: pd.DataFrame, column: str, n: int):
corr_matrix = data.corr()
corr_sorted = corr_matrix[column].sort_values(key=lambda x: -x)[1:n 1]
return pd.DataFrame({
'Features': [column for _ in range(n)],
'Top Similar Features': corr_sorted.index,
'Correlation': corr_sorted.values
})
result = top_corr(df, 'a', 3)
print(result)
Result:
Features Top Similar Features Correlation
0 a b 0.974219
1 a c 0.311234
2 a d -0.075999
CodePudding user response:
(df.corr()
.loc[var, df.columns.difference([var])]
.nlargest(n))
query the row of the variable, and columns where everything but that variable is included. Then get the nlargest
with a predefined n
. This gets us a good place:
In [66]: var = "a"
In [67]: n = 3
In [68]: (df.corr()
...: .loc[var, df.columns.difference([var])]
...: .nlargest(n))
Out[68]:
e 0.779293
b 0.046749
c -0.499404
Name: a, dtype: float64
Rest is aesthetics somewhat...
In [70]: (df.corr()
...: .loc[var, df.columns.difference([var])]
...: .nlargest(n)
...: .reset_index(name="correlation")
...: .rename(columns={"index": f"top {n} similar feats"})
...: .assign(feature=var))
Out[70]:
top 3 similar feats correlation feature
0 e 0.779293 a
1 b 0.046749 a
2 c -0.499404 a
Converting this to a function of var, n
is an exercise for the reader :)
(due to lack of seed, we get different results.)