Why doesn't Keras one-hot encode have not zeroes?-CodePudding

For example:

from tensorflow.keras.preprocessing.text import one_hot
vocab_size = 5
one_hot('good job', vocab_size)
Out[6]: [3, 2]

For each word, it only assigns a single integer '3' and '2', not a vector of size 5 with 1 and 0s? Should one-hot encoding always yield a vector of 1 and 0s?

CodePudding user response：

This is the way this function works. It yields integers instead of OHE. Probably they are also deprecating it because of unnatural usage. Seems like tensorflow.keras.preprocessing.text.one_hot is being deprecated.

Deprecated: tf.keras.text.preprocessing.one_hot does not operate on tensors and is not recommended for new code. Prefer tf.keras.layers.Hashing with output_mode='one_hot' which provides equivalent functionality through a layer which accepts tf.Tensor input. See the preprocessing layer guide for an overview of preprocessing layers.

The recommendation is to use :

tf.keras.layers.Hashing(
    num_bins,
    mask_value=None,
    salt=None,
    output_mode='int',
    sparse=False,
    **kwargs
)

If you modify the output_mode from int to multi_hot you will get the one-hot vectors you are looking for.

From the documentation:

Specification for the output of the layer. Defaults to "int". Values can be "int", "one_hot", "multi_hot", or "count" configuring the layer as follows:
"int": Return the integer bin indices directly.
"one_hot": Encodes each individual element in the input into an array the same size as num_bins, containing a 1 at the input's bin
index. If the last dimension is size 1, will encode on that dimension. If the last dimension is not size 1, will append a new dimension for the encoded output. "multi_hot": Encodes each sample in the input into a single array the same size as num_bins, containing a 1 for each bin index index present in the sample. Treats the last dimension as the sample dimension, if input shape is (..., sample_length), output shape will be (..., num_tokens). "count": As "multi_hot", but the int array contains a count of the number of times the bin index appeared in the sample.