Home > database >  Get correlation per groupby/apply in Python Polars
Get correlation per groupby/apply in Python Polars

Time:11-28

I have a pandas DataFrame df:

d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pd.DataFrame(data=d)

And I want to apply a correlation between the feature_cols = ['feature1', 'feature2'] and the TARGET_COL = 'target' for each era:

corrs_split = (
          training_data
          .groupby("era")
          .apply(lambda d: d[feature_cols].corrwith(d[TARGET_COL]))
)

I've been trying to get this done with Polars, but I can't get a polars dataframe with a column for each different era and the correlations for each feature. The maximum I've got, is a single column, with all the correlations calculated, but without the era as index and not discriminated by feature.

CodePudding user response:

Here's the polars equivalent of that code. You can do this by combining groupby() and agg().

import polars as pl

d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pl.DataFrame(d)
feature_cols = ['feature1', 'feature2']
TARGET_COL = 'target'

agg_cols = []
for feature_col in feature_cols:
    agg_cols  = [pl.pearson_corr(feature_col, TARGET_COL)]
print(df.groupby("era").agg(agg_cols))

Output:

shape: (3, 3)
┌─────┬──────────┬──────────┐
│ era ┆ feature1 ┆ feature2 │
│ --- ┆ ---      ┆ ---      │
│ str ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ a   ┆ 1.0      ┆ 1.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ 1.0      ┆ 1.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ 1.0      ┆ 1.0      │
└─────┴──────────┴──────────┘

(Order may be different for you.)

  • Related