How to use groupby transform instead of pipe?-CodePudding

Let's say I have a dataframe like this

import pandas as pd
from scipy import stats

df = pd.DataFrame(
    {
        'group': list('abaab'),
        'val1': range(5),
        'val2': range(2, 7),
        'val3': range(4, 9)
    }
)

  group  val1  val2  val3
0     a     0     2     4
1     b     1     3     5
2     a     2     4     6
3     a     3     5     7
4     b     4     6     8

Now I want to calculate linear regressions for each group in column group using two of the vali columns (potentially all pairs, so I don't want to hardcode column names anywhere).

A potential implementation could look like this based on pipe

def do_lin_reg_pipe(df, group_col, col1, col2):
    group_names = df[group_col].unique()
    df_subsets = []
    for s in group_names:
        df_subset = df.loc[df[group_col] == s]
        x = df_subset[col1].values
        y = df_subset[col2].values
        slope, intercept, r, p, se = stats.linregress(x, y)
        df_subset = df_subset.assign(
            slope=slope,
            intercept=intercept,
            r=r,
            p=p,
            se=se
        )
        df_subsets.append(df_subset)
    return pd.concat(df_subsets)

and then I can use

df_linreg_pipe = (
    df.pipe(do_lin_reg_pipe, group_col='group', col1='val1', col2='val3')
      .assign(p=lambda d: d['p'].round(3))
)

which gives the desired outcome

  group  val1  val2  val3  slope  intercept    r    p   se
0     a     0     2     4    1.0        4.0  1.0  0.0  0.0
2     a     2     4     6    1.0        4.0  1.0  0.0  0.0
3     a     3     5     7    1.0        4.0  1.0  0.0  0.0
1     b     1     3     5    1.0        4.0  1.0  0.0  0.0
4     b     4     6     8    1.0        4.0  1.0  0.0  0.0

What I don't like is that I have to loop through the groups, use and append and then also concat, so I thought I should somehow use a groupby and transform but I don't get this to work. The function call should be something like

df_linreg_transform = df.copy()
df_linreg_transform[['slope', 'intercept', 'r', 'p', 'se']] = (
    df.groupby('group').transform(do_lin_reg_transform, col1='val1', col2='val3')
)

question is how to define do_lin_reg_transform; I would like to have something along these lines

def do_lin_reg_transform(df, col1, col2):
    
    x = df[col1].values
    y = df[col2].values
    slope, intercept, r, p, se = stats.linregress(x, y)

    return (slope, intercept, r, p, se)

but that - of course - crashes with a KeyError

KeyError: 'val1'

How could one implement do_lin_reg_transform to make it work with groupby and transform?

CodePudding user response：

As you can use groupby_transform because you need extra columns to compute the result, the idea is to use groupby_apply with map to broadcast the result to each rows:

cols = ['slope', 'intercept', 'r', 'p', 'se']
lingress = lambda x: stats.linregress(x['val1'], x['val3'])

df[cols] = pd.DataFrame.from_records(df['group'].map(df.groupby('group').apply(lingress)))
print(df)

# Output
  group  val1  val2  val3  slope  intercept    r             p   se
0     a     0     2     4    1.0        4.0  1.0  9.003163e-11  0.0
1     b     1     3     5    1.0        4.0  1.0  0.000000e 00  0.0
2     a     2     4     6    1.0        4.0  1.0  9.003163e-11  0.0
3     a     3     5     7    1.0        4.0  1.0  9.003163e-11  0.0
4     b     4     6     8    1.0        4.0  1.0  0.000000e 00  0.0

CodePudding user response：

Transform is meant to aggregate results for a single column. A regression requires multiple so you should use apply.

If you wanted, you could define your aggregation to return a DataFrame as opposed to the Series (so the result doesn't reduce). For this to work, you'd want to make sure you index is unique. Then concat the result back so it aligns on the index. You won't have any issues if there's more than 1 grouping column.

def group_reg(gp, col1, col2):
    df = pd.DataFrame([stats.linregress(gp[col1], gp[col2])]*len(gp), 
                      columns=['slope', 'intercept', 'r', 'p', 'se'],
                      index=gp.index)
    return df

pd.concat([df, df.groupby('group').apply(group_reg, col1='val1', col2='val3')], axis=1)

  group  val1  val2  val3  slope  intercept    r             p   se
0     a     0     2     4    1.0        4.0  1.0  9.003163e-11  0.0
1     b     1     3     5    1.0        4.0  1.0  0.000000e 00  0.0
2     a     2     4     6    1.0        4.0  1.0  9.003163e-11  0.0
3     a     3     5     7    1.0        4.0  1.0  9.003163e-11  0.0
4     b     4     6     8    1.0        4.0  1.0  0.000000e 00  0.0