I want to apply linear regression and predict values to subsets of my original data by V1, V2, V3, V4, V5, and V6
. Then I want to store dataframe with names: V1, V2, V3, V4, V5, V6, time, Predicted value
. How to achieve it effificiently? What I have now gives me an object that is hard to further work with.
def model(df):
X = df['time'].to_numpy().reshape((-1, 1))
Y = df['speed'].to_numpy()
X_new = np.arange(1, 60, 1).reshape((-1, 1))
return np.squeeze(LinearRegression().fit(X, Y).predict(X_new))
def group_predictions(df):
return df.groupby(['V1', 'V2', 'V3', 'V4', 'V5','V6']).apply(model)
CodePudding user response:
The output must be a Series of numpy arrays, so explode()
should do the trick.
However, time
cannot be a column in the output because the dimensions won't match. Function model()
returns the predicted values, so unless the length of each sub-df is 59, time
cannot be one of the output columns.
def group_predictions(df):
return df.groupby(['V1', 'V2', 'V3', 'V4', 'V5','V6']).apply(model).explode().reset_index(name='Predicted value')
If X_new
also must be returned, it's more readable to construct dfs in model()
itself. Then group_predictions()
must also be modified to accommodate the fact that model()
returns a df, not array.
def model(df):
X = df['time'].to_numpy().reshape((-1, 1))
Y = df['speed'].to_numpy()
X_new = np.arange(1, 60, 1).reshape((-1, 1))
return pd.DataFrame({'X_new': X_new.ravel(), 'Predicted value': LinearRegression().fit(X, Y).predict(X_new)})
def group_predictions(df):
return df.groupby(['V1', 'V2', 'V3', 'V4', 'V5','V6']).apply(model).droplevel(-1).reset_index()