Tensorflow TextVectorization adapt() -- checking the produced vocabulary-CodePudding

A text TextVectorization layer is used for word encoding, and the typical workflow calls the adapt() method

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

(https://www.tensorflow.org/tutorials/keras/text_classification)

If desired, the user can call this layer's adapt() method on a dataset. When this layer is adapted, it will analyze the dataset, determine the frequency of individual string values, and create a 'vocabulary' from them.

(https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt)

What is precisely the result of the adapt() operation, and how to check concretely the content of the created vocabulary?

A small piece of my code

seq_length = 100
vocab_size=50000

vectorize_layer = TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=seq_length)

# build the vocabulary
vectorize_layer.adapt(text_ds)

CodePudding user response：

layer.get_vocabulary() does this:

>>>data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
>>>layer = tf.keras.layers.StringLookup()
>>>layer.adapt(data)
>>>layer.get_vocabulary()

['[UNK]', 'd', 'z', 'c', 'b', 'a']

https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup