For example:
from tensorflow.keras.preprocessing.text import one_hot
vocab_size = 5
one_hot('good job', vocab_size)
Out[6]: [3, 2]
For each word, it only assigns a single integer '3' and '2', not a vector of size 5 with 1 and 0s? Should one-hot encoding always yield a vector of 1 and 0s?
CodePudding user response:
This is the way this function works. It yields integers instead of OHE. Probably they are also deprecating it because of unnatural usage. Seems like tensorflow.keras.preprocessing.text.one_hot
is being deprecated.
Deprecated: tf.keras.text.preprocessing.one_hot does not operate on tensors and is not recommended for new code. Prefer tf.keras.layers.Hashing with output_mode='one_hot' which provides equivalent functionality through a layer which accepts tf.Tensor input. See the preprocessing layer guide for an overview of preprocessing layers.
The recommendation is to use :
tf.keras.layers.Hashing(
num_bins,
mask_value=None,
salt=None,
output_mode='int',
sparse=False,
**kwargs
)
If you modify the output_mode from int
to multi_hot
you will get the one-hot vectors you are looking for.
From the documentation:
Specification for the output of the layer. Defaults to "int". Values can be "int", "one_hot", "multi_hot", or "count" configuring the layer as follows:
"int": Return the integer bin indices directly. "one_hot": Encodes each individual element in the input into an array the same size as num_bins, containing a 1 at the input's bin
index. If the last dimension is size 1, will encode on that dimension. If the last dimension is not size 1, will append a new dimension for the encoded output. "multi_hot": Encodes each sample in the input into a single array the same size as num_bins, containing a 1 for each bin index index present in the sample. Treats the last dimension as the sample dimension, if input shape is (..., sample_length), output shape will be (..., num_tokens). "count": As "multi_hot", but the int array contains a count of the number of times the bin index appeared in the sample.