Get correlation per groupby/apply in Python Polars-CodePudding

I have a pandas DataFrame df:

d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pd.DataFrame(data=d)

And I want to apply a correlation between the feature_cols = ['feature1', 'feature2'] and the TARGET_COL = 'target' for each era:

corrs_split = (
          training_data
          .groupby("era")
          .apply(lambda d: d[feature_cols].corrwith(d[TARGET_COL]))
)

I've been trying to get this done with Polars, but I can't get a polars dataframe with a column for each different era and the correlations for each feature. The maximum I've got, is a single column, with all the correlations calculated, but without the era as index and not discriminated by feature.

CodePudding user response：

Here's the polars equivalent of that code. You can do this by combining groupby() and agg().

import polars as pl

d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pl.DataFrame(d)
feature_cols = ['feature1', 'feature2']
TARGET_COL = 'target'

agg_cols = []
for feature_col in feature_cols:
    agg_cols  = [pl.pearson_corr(feature_col, TARGET_COL)]
print(df.groupby("era").agg(agg_cols))

Output:

shape: (3, 3)
┌─────┬──────────┬──────────┐
│ era ┆ feature1 ┆ feature2 │
│ --- ┆ ---      ┆ ---      │
│ str ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ a   ┆ 1.0      ┆ 1.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ 1.0      ┆ 1.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ 1.0      ┆ 1.0      │
└─────┴──────────┴──────────┘

(Order may be different for you.)