Let's say I have a dataframe like this
import pandas as pd
from scipy import stats
df = pd.DataFrame(
{
'group': list('abaab'),
'val1': range(5),
'val2': range(2, 7),
'val3': range(4, 9)
}
)
group val1 val2 val3
0 a 0 2 4
1 b 1 3 5
2 a 2 4 6
3 a 3 5 7
4 b 4 6 8
Now I want to calculate linear regressions for each group in column group
using two of the vali
columns (potentially all pairs, so I don't want to hardcode column names anywhere).
A potential implementation could look like this based on pipe
def do_lin_reg_pipe(df, group_col, col1, col2):
group_names = df[group_col].unique()
df_subsets = []
for s in group_names:
df_subset = df.loc[df[group_col] == s]
x = df_subset[col1].values
y = df_subset[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
df_subset = df_subset.assign(
slope=slope,
intercept=intercept,
r=r,
p=p,
se=se
)
df_subsets.append(df_subset)
return pd.concat(df_subsets)
and then I can use
df_linreg_pipe = (
df.pipe(do_lin_reg_pipe, group_col='group', col1='val1', col2='val3')
.assign(p=lambda d: d['p'].round(3))
)
which gives the desired outcome
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 0.0 0.0
2 a 2 4 6 1.0 4.0 1.0 0.0 0.0
3 a 3 5 7 1.0 4.0 1.0 0.0 0.0
1 b 1 3 5 1.0 4.0 1.0 0.0 0.0
4 b 4 6 8 1.0 4.0 1.0 0.0 0.0
What I don't like is that I have to loop through the groups, use and append
and then also concat
, so I thought I should somehow use a groupby
and transform
but I don't get this to work. The function call should be something like
df_linreg_transform = df.copy()
df_linreg_transform[['slope', 'intercept', 'r', 'p', 'se']] = (
df.groupby('group').transform(do_lin_reg_transform, col1='val1', col2='val3')
)
question is how to define do_lin_reg_transform
; I would like to have something along these lines
def do_lin_reg_transform(df, col1, col2):
x = df[col1].values
y = df[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
return (slope, intercept, r, p, se)
but that - of course - crashes with a KeyError
KeyError: 'val1'
How could one implement do_lin_reg_transform
to make it work with groupby
and transform
?
CodePudding user response:
As you can use groupby_transform
because you need extra columns to compute the result, the idea is to use groupby_apply
with map
to broadcast the result to each rows:
cols = ['slope', 'intercept', 'r', 'p', 'se']
lingress = lambda x: stats.linregress(x['val1'], x['val3'])
df[cols] = pd.DataFrame.from_records(df['group'].map(df.groupby('group').apply(lingress)))
print(df)
# Output
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e 00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e 00 0.0
CodePudding user response:
Transform is meant to aggregate results for a single column. A regression requires multiple so you should use apply
.
If you wanted, you could define your aggregation to return a DataFrame as opposed to the Series (so the result doesn't reduce). For this to work, you'd want to make sure you index is unique. Then concat
the result back so it aligns on the index. You won't have any issues if there's more than 1 grouping column.
def group_reg(gp, col1, col2):
df = pd.DataFrame([stats.linregress(gp[col1], gp[col2])]*len(gp),
columns=['slope', 'intercept', 'r', 'p', 'se'],
index=gp.index)
return df
pd.concat([df, df.groupby('group').apply(group_reg, col1='val1', col2='val3')], axis=1)
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e 00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e 00 0.0