Multivariate Keras Prediction Model With LSTM: Which index is used when predicting?-CodePudding

Apologies as I am new to using Keras and working with LSTM predictions in general. I am writing code that takes in a CSV file whose columns are float or int values which are related in some way, uses a Keras LSTM model to train against these columns, and attempts to predict one of the columns as an output. I am following this guide:

https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

In the example, the relevant predicted column is the amount of air pollution, and all other values are used to predict this value. Adapting the example code for my CSV seems straightforward-- I change the size of the training data and the number of columns appropriately.

My issue is that I don't understand why the example code is outputting predicted values for the "pollution" column and not some other column. I could just make the column I want to predict the second column in my formatted input CSV, but I would like to understand what is actually happening in the example as much as possible. Looking at the documentation for Model.predict(), it says the input value can be a list of arrays if the model has multiple inputs, but the return value is only described as "numpy array(s) of predictions", and it doesn't specify how I can make it return "arrays" versus an array. When I print out the result of this function, I only get an array of predictions for the pollution variable, so it seems like that column is selected somewhere before this point.

How can I change which column is returned by predict()?

CodePudding user response：

Changing which column is returned by predict() depends on what you select your output data (y) to be. When the author preprocessed their data, they made the current pollution the last column of their dataset. Then, when selecting an output, y, to evaluate the model, they ran these lines of code:

# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]

the input arrays (X) array include all rows and every column except the last, as denoted by their index, whereas the output (y) array includes all rows but only the last column, which is the pollution variable.

When the model is training, it is trying to use the inputs, in this case the previous timestep inputs, to accurately predict the output, which in this case is the pollution at the current time. Therefore, when the model makes predictions, it will use this function that it learned to relate the two datasets to predict pollution.

So, in summation, select the column that you want your model to predict as the train_y and test_y datasets! Hope this helps!