How to predict on a grouped DataFrame, using a dictionary of models, and return to original test Dat-CodePudding

I have created a dictionary of regression models, indexed by values of group from a training dataset, d

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

d = pd.DataFrame({
    "group":["cat","fish","horse","cat","fish","horse","cat","horse"],
    "x":[1,4,7,2,5,8,3,9],
    "y":[10,20,14,12,12,3,12,2],
    "z":[3,5,3,5,9,1,2,3]
})

features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
    models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
    x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
    models[animal].fit(x,y)

I also have a test dataset, test_d, which has rows for some, but not all the groups (i.e. all the models).

test_d = pd.DataFrame({
    "group":["dog","fish","horse","dog","fish","horse","dog","horse"],
    "x":[1,2,3,4,5,6,7,8],
    "z":[3,5,3,5,9,1,2,3]
})

I wanted to use apply on the grouped test_d, leveraging .name to lookup the correct model (if it exists), and return the predictions, using a function f()

def f(g):
    try:
        predictions = models[g.name].predict(g[features])
    except:
        predictions = [None]*len(g)
    return predictions

The function "works" in the sense that it returns the correct values

grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)

Output:

group
dog                           [None, None, None]
fish     [20.94117647058824, 12.000000000000004]
horse                          [38.0, 15.0, 8.0]
dtype: object

Question:

How should f() be written so that I can assign the values directly to test_d? I want to do something like this:

test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)

But this doesn't work, obviously.

   group  x  z predictions
0    dog  1  3         NaN
1   fish  2  5         NaN
2  horse  3  3         NaN
3    dog  4  5         NaN
4   fish  5  9         NaN
5  horse  6  1         NaN
6    dog  7  2         NaN
7  horse  8  3         NaN

Expected Output

   group  x  z  predictions
0    dog  1  3          NaN
1   fish  2  5    20.941176
2  horse  3  3    38.000000
3    dog  4  5          NaN
4   fish  5  9    12.000000
5  horse  6  1    15.000000
6    dog  7  2          NaN
7  horse  8  3     8.000000

CodePudding user response：

Your function f should return a Series with the original index:

def f(g):
    try:
        predictions = models[g.name].predict(g[features])
    except:
        predictions = [None]*len(g)
    return pd.Series(predictions, index=g.index)

test_d.groupby('group', group_keys=False).apply(f)

Output:

0         None
3         None
6         None
1    20.941176
4         12.0
2         38.0
5         15.0
7          8.0
dtype: object

So if you assign, the indices will align:

test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)

Output:

   group  x  z predictions
0    dog  1  3        None
1   fish  2  5   20.941176
2  horse  3  3        38.0
3    dog  4  5        None
4   fish  5  9        12.0
5  horse  6  1        15.0
6    dog  7  2        None
7  horse  8  3         8.0