I have a pandas DataFrame df
:
d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pd.DataFrame(data=d)
And I want to apply a correlation between the feature_cols = ['feature1', 'feature2']
and the TARGET_COL = 'target'
for each era
:
corrs_split = (
training_data
.groupby("era")
.apply(lambda d: d[feature_cols].corrwith(d[TARGET_COL]))
)
I've been trying to get this done with Polars, but I can't get a polars dataframe with a column for each different era
and the correlations for each feature
. The maximum I've got, is a single column, with all the correlations calculated, but without the era
as index and not discriminated by feature
.
CodePudding user response:
Here's the polars equivalent of that code. You can do this by combining groupby()
and agg()
.
import polars as pl
d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pl.DataFrame(d)
feature_cols = ['feature1', 'feature2']
TARGET_COL = 'target'
agg_cols = []
for feature_col in feature_cols:
agg_cols = [pl.pearson_corr(feature_col, TARGET_COL)]
print(df.groupby("era").agg(agg_cols))
Output:
shape: (3, 3)
┌─────┬──────────┬──────────┐
│ era ┆ feature1 ┆ feature2 │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ a ┆ 1.0 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 1.0 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1.0 ┆ 1.0 │
└─────┴──────────┴──────────┘
(Order may be different for you.)