add numpy array to pandas df-CodePudding

Im experimenting with time series predictions something like this:

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(data.values, 
                order=order, 
                seasonal_order=seasonal_order)

result = model.fit()

train = data.sample(frac=0.8,random_state=0)
test = data.drop(train.index)
start = len(train)
end = len(train)   len(test) - 1
  
# Predictions for one-year against the test set
predictions = result.predict(start, end,
                             typ='levels')

where predictions is a numpy array. How do I add this to my test pandas df? If I try this: test['predicted'] = predictions.tolist()

This wont contact properly where I was hoping to add in the prediction as another column in my df. It looks like this below:

hour
2021-06-07 17:00:00                                          75726.57143
2021-06-07 20:00:00                                          62670.06667
2021-06-08 00:00:00                                             16521.65
2021-06-08 14:00:00                                              71628.1
2021-06-08 17:00:00                                          62437.16667
                                             ...                        
2021-09-23 22:00:00                                          7108.533333
2021-09-24 02:00:00                                              13325.2
2021-09-24 04:00:00                                          13322.31667
2021-09-24 13:00:00                                             37941.65
predicted              [13605.31231433516, 12597.907337725523, 13484....  <--- not coming in as another df column

Would anyone have any advice? Am hoping to ultimately plot the predicted values against the test values as well as calculate rsme maybe something like:

from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse

# Calculate root mean squared error
rmse(test, predictions)
  
# Calculate mean squared error
mean_squared_error(test, predictions)

EDIT

train = data.sample(frac=0.8,random_state=0)
test = data.drop(train.index)

start = len(train)
end = len(train)   len(test) - 1

CodePudding user response：

You should be able to add it as a column directly without needing to do any additional conversion. The output from result.predict() should be a Pandas series. If not, you should still be able to simply add it directly to the dataframe so long as it's the same length and order.

test = pd.DataFrame({'date': ['01-01-2020', '01-02-2020', '01-03-2020', '01-04-2020', '01-05-2020'],
                     'value': [15, 25, 35, 45 ,55]}
                   )
test['date'] = pd.to_datetime(test['date'])
test = test.set_index('date')

predictions = np.array([10,20,30,40,50])

test['predictions'] = predictions

Output:

            value  predictions
date                          
2020-01-01     15           10
2020-01-02     25           20
2020-01-03     35           30
2020-01-04     45           40
2020-01-05     55           50