I'm playing around a bit with Tensorflow 2.7.0 and its new TextVectorization
layer. However, something does not work quite right in this simple example:
import tensorflow as tf
import numpy as np
X = np.array(['this is a test', 'a nice test', 'best test this is'])
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(X)
emb_layer = tf.keras.layers.Embedding(input_dim=vectorize_layer.vocabulary_size() 1, output_dim=2, input_length=4)
flatten_layer = tf.keras.layers.Flatten()
dense_layer = tf.keras.layers.Dense(1)
model = tf.keras.models.Sequential()
model.add(vectorize_layer)
model.add(emb_layer)
model.add(flatten_layer)
#model.add(dense_layer)
model(X)
This works so far. I make ints out of words, embed them, flatten them. But if I want to add a Dense
layer after flattening (i.e. uncomment a line), things break, and I get the error message from the question title. I even used the input_length
parameter of the Embedding
layer because the documentation says that I should specify this when using embedding->flatten->dense. But it just does not work.
Do you know how I can get it to work using Flatten
and not something like GlobalAveragePooling1D
?
Thanks a lot!
CodePudding user response:
You need to define a max length for the sequences.
vectorize_layer = tf.keras.layers.TextVectorization(output_mode = 'int',
output_sequence_length=10)
If you check model.summary()
, output shape of the TextVectorization
will be (None, None)
.
First None
indicates that model can accept any batch size, and the second one indicates that any sentence that is passed to TextVectorization
will not be truncated or padded. So the output sentence can have variable length.
Example:
import tensorflow as tf
import numpy as np
X = np.array(['this is a test', 'a nice test', 'best test this is'])
vectorize_layer = tf.keras.layers.TextVectorization(output_mode = 'int')
vectorize_layer.adapt(X)
model = tf.keras.models.Sequential()
model.add(vectorize_layer)
model(np.array(['this is a test']))
>> <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[3, 4, 5, 2]])>
model(np.array(['this is a longer test sentence']))
>> <tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[3, 4, 5, 1, 2, 1]])>
Redefining it:
vectorize_layer = tf.keras.layers.TextVectorization(output_mode = 'int',
output_sequence_length = 5)
model(np.array(['this is a longer test sentence']))
>> <tf.Tensor: shape=(1, 5), dtype=int64, numpy=array([[3, 4, 5, 1, 2]])>
model(np.array(['this is']))
>> <tf.Tensor: shape=(1, 5), dtype=int64, numpy=array([[3, 4, 0, 0, 0]])>
Defining output_sequence_length
to a number will make sure the outputs' lengths are a fixed number.