I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using a Sequential model from the keras API of Tensorflow.
You can imagine my dataset to look something like this:
Index | x data | y data |
---|---|---|
0 | np.ndarray(shape (1209278,) ) |
numpy.float32 |
1 | np.ndarray(shape (1211140,) ) |
numpy.float32 |
2 | np.ndarray(shape (1418411,) ) |
numpy.float32 |
3 | np.ndarray(shape (1077132,) ) |
numpy.float32 |
... | ... | ... |
This was my first attempt:
I tried using a numpy ndarray which contains numpy ndarrays which finally contain floats as my xdata, so something like this:
array([
array([3.59280851, 3.60459062, 3.60459062, ..., 4.02911493])
array([3.54752101, 3.56740332, 3.56740332, ..., 4.02837855])
array([3.61048168, 3.62152741, 3.62152741, ..., 4.02764217])
])
My y data is a numpy ndarray containing floats, which looks something like this
array([2.9864411, 3.0562437, ... , 2.7750807, 2.8712902], dtype=float32)
But when I tried to train the model using model.fit()
it yields this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
I was able to solve this error by asking a question related to this: How can I have a series of numpy ndarrays as the input data to train a tensorflow machine learning model?
My latest attempt: Because Tensorflow does not seem to be able to convert a ndarray of ndarrays to a tensor, I tried to convert my x data to a list of ndarrays like this:
[
array([3.59280851, 3.60459062, 3.60459062, ..., 4.02911493])
array([3.54752101, 3.56740332, 3.56740332, ..., 4.02837855])
array([3.61048168, 3.62152741, 3.62152741, ..., 4.02764217])
]
I left my y data untouched, so as a ndarray of floats. Sadly my attempt of using a list of ndarrays instead of a ndarray of ndarrays yielded this error:
ValueError: Data cardinality is ambiguous:
x sizes: 1304593, 1209278, 1407624, ...
y sizes: 46
Make sure all arrays contain the same number of samples.
As you can see, my x data consists of arrays which all have a different shape. But I don't think that this should be a problem.
Question:
My guess is that Tensorflow tries to use my list of arrays as multiple inputs. Tensorflow fit() documentation
But I don't want to use my x data as multiple inputs. Easily said I just want my model to predict a number from a sequence of numbers. For example like this:
- array([3.59280851, 3.60459062, 3.60459062, ...]) => 2.8989773
- array([3.54752101, 3.56740332, 3.56740332, ...]) => 3.0893357
- ...
How can I use a sequence of numbers to predict a single number in Tensorflow?
EDIT
Maybe I should have added that I want to use a RNN, especially a LSTM.
I have had a look at the Keras documentation, and in their simplest example they are using a Embedding
layer. But I don't really know what to do.
All in all I think that my question ist pretty general and should be easy to answer if you know how to tackle this problem, unlike me. Thanks in advance!
CodePudding user response:
Try something like this:
import numpy as np
import tensorflow as tf
# add additional dimension for lstm layer
x_train = np.asarray(train_set["x data"].values))[..., None]
y_train = np.asarray(train_set["y data"]).astype(np.float32)
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(units=32))
model.add(tf.keras.layers.Dense(units=1))
model.compile(loss="mean_squared_error", optimizer="adam", metrics="mse")
model.fit(x=x_train,y=y_train,epochs=10)
Or with a ragged input for different sequence lengths:
x_train = tf.ragged.constant(train_set["x data"].values[..., None]) # add additional dimension for lstm layer
y_train = np.asarray(train_set["y data"]).astype(np.float32)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=[None, x_train.bounding_shape()[-1]], batch_size=2, dtype=tf.float32, ragged=True))
model.add(tf.keras.layers.LSTM(units=32))
model.add(tf.keras.layers.Dense(units=1))
model.compile(loss="mean_squared_error", optimizer="adam", metrics="mse")
model.fit(x=x_train,y=y_train,epochs=10)
Or:
x_train = tf.ragged.constant([np.array(list(v))[..., None] for v in train_set["x data"].values]) # add additional dimension for lstm layer