Home > Mobile >  Why doesn't Keras one-hot encode have not zeroes?
Why doesn't Keras one-hot encode have not zeroes?

Time:01-10

For example:

from tensorflow.keras.preprocessing.text import one_hot
vocab_size = 5
one_hot('good job', vocab_size)
Out[6]: [3, 2]

For each word, it only assigns a single integer '3' and '2', not a vector of size 5 with 1 and 0s? Should one-hot encoding always yield a vector of 1 and 0s?

CodePudding user response:

This is the way this function works. It yields integers instead of OHE. Probably they are also deprecating it because of unnatural usage. Seems like tensorflow.keras.preprocessing.text.one_hot is being deprecated.

Deprecated: tf.keras.text.preprocessing.one_hot does not operate on tensors and is not recommended for new code. Prefer tf.keras.layers.Hashing with output_mode='one_hot' which provides equivalent functionality through a layer which accepts tf.Tensor input. See the preprocessing layer guide for an overview of preprocessing layers.

The recommendation is to use :

tf.keras.layers.Hashing(
    num_bins,
    mask_value=None,
    salt=None,
    output_mode='int',
    sparse=False,
    **kwargs
)

If you modify the output_mode from int to multi_hot you will get the one-hot vectors you are looking for.

From the documentation:

Specification for the output of the layer. Defaults to "int". Values can be "int", "one_hot", "multi_hot", or "count" configuring the layer as follows:

"int": Return the integer bin indices directly.
"one_hot": Encodes each individual element in the input into an array the same size as num_bins, containing a 1 at the input's bin

index. If the last dimension is size 1, will encode on that dimension. If the last dimension is not size 1, will append a new dimension for the encoded output. "multi_hot": Encodes each sample in the input into a single array the same size as num_bins, containing a 1 for each bin index index present in the sample. Treats the last dimension as the sample dimension, if input shape is (..., sample_length), output shape will be (..., num_tokens). "count": As "multi_hot", but the int array contains a count of the number of times the bin index appeared in the sample.

  • Related