What's the output of an Embedding layer in tensorflow and what does GlobalAveragePooling1D do i-CodePudding

I'm having trouble understanding what a 1D global average pooling does to an embedding layer. I know that embedding layers are like lookup tables. If I have tf.keras.layers.Embedding(vocab_size=30, embedding_dim=7, input_length=10), is the output after feed forwarding a matrix of 10 rows x 7 columns or a 3D tensor of 1 row x 7 columns x 10 length?

If it's 10 rows x 7 columns, does it take the average of each row and output a single vector of shape 10 row x 1 columns?

If it's 1 row x 7 columns x 10 length, does it take the average of each vector and output a single vector also in shape 10 row x 1 columns?

CodePudding user response：

To your first question: What's the output of an Embedding layer in tensorflow?

The Embedding layer maps each integer value in a sequence that represents a unique word in the vocabulary to a 7-dimensional vector. In the following example, you have two sequences with 10 integer values each. These integer values can range from 0 to 30, where 30 is the size of the vocabulary. Each integer value of each sequence is mapped to a 7-dimensional vector, resulting in the output shape (2, 10, 7), where 2 is the number of samples, 10 is the sequence length and 7 is the dimension of each integer value:

import tensorflow as tf

samples = 2
texts = tf.random.uniform((samples, 10), maxval=30, dtype=tf.int32)

embedding_layer = tf.keras.layers.Embedding(30, 7, input_length=10)
print(embedding_layer(texts))

tf.Tensor(
[[[ 0.0225671   0.02347589  0.00979777  0.00041901 -0.00628462
    0.02810872 -0.00962182]
  [-0.00848696 -0.04342243 -0.02836052 -0.00517335 -0.0061365
   -0.03012114  0.01677728]
  [ 0.03311044  0.00556745 -0.00702027  0.03381392 -0.04623893
    0.04987461 -0.04816799]
  [-0.03521906  0.0379228   0.03005264 -0.0020758  -0.0384485
    0.04822161 -0.02092661]
  [-0.03521906  0.0379228   0.03005264 -0.0020758  -0.0384485
    0.04822161 -0.02092661]
  [-0.01790254 -0.0175228  -0.01194855 -0.02171307 -0.0059397
    0.02812174  0.01709754]
  [ 0.03117083  0.03501941  0.01058724  0.0452967  -0.03717183
   -0.04691924  0.04459465]
  [-0.0225444   0.01631368 -0.04825303  0.02976335  0.03874404
    0.01886607 -0.04535152]
  [-0.01405543 -0.01035894 -0.01828993  0.01214089 -0.0163126
    0.00249451 -0.03320551]
  [-0.00536104  0.04976835  0.03676006 -0.04985759 -0.04882429
    0.04079831 -0.04694915]]

 [[ 0.02474061  0.04651412  0.01263839  0.02834389  0.01770737
    0.027616    0.0391163 ]
  [-0.00848696 -0.04342243 -0.02836052 -0.00517335 -0.0061365
   -0.03012114  0.01677728]
  [-0.02423838  0.00046005  0.01264722 -0.00118362 -0.04956226
   -0.00222496  0.00678415]
  [ 0.02132202  0.02490019  0.015528    0.01769954  0.03830704
   -0.03469712 -0.00817447]
  [-0.03713315 -0.01064591  0.0106518  -0.00899752 -0.04772154
    0.03767705 -0.02580358]
  [ 0.02132202  0.02490019  0.015528    0.01769954  0.03830704
   -0.03469712 -0.00817447]
  [ 0.00416059 -0.03158562  0.00862025 -0.03387908  0.02394537
   -0.00088609  0.01963869]
  [-0.0454465   0.03087567 -0.01201812 -0.02580545  0.02585572
   -0.00974055 -0.02253721]
  [-0.00438716  0.03688161  0.04575384 -0.01561296 -0.0137012
   -0.00927494 -0.02183568]
  [ 0.0225671   0.02347589  0.00979777  0.00041901 -0.00628462
    0.02810872 -0.00962182]]], shape=(2, 10, 7), dtype=float32)

When working with text data, the output of an Embedding layer would be 2 sentences consisting of 10 words each, where each word is mapped to a 7-dimensional vector.

If you are wondering where these random numbers for each integer in each sequence come from, by default the Embedding layer uses a uniform distribution to generate these values.

To your second question: What does a 1D global average pooling do to an Embedding layer?

The layer GlobalAveragePooling1D does nothing more than simply calculate the average over a given dimension in a tensor. The following example calculates the average of the 7 numbers representing a word in each sequence and returns a scalar for each word, resulting in the output shape (2, 10), where 2 is the number of samples (sentences) and 10 represents the average values for of each word. This is equivalent to simply doing tf.reduce_mean(embedding_layer(texts), axis=-1).

import tensorflow as tf

samples = 2
texts = tf.random.uniform((samples, 10), maxval=30, dtype=tf.int32)

embedding_layer = tf.keras.layers.Embedding(30, 7, input_length=10)
average_layer = tf.keras.layers.GlobalAveragePooling1D(data_format = "channels_first")
print(average_layer(embedding_layer(texts)))

CodePudding user response：

GlobalAveragePooling1D reduces the dimension of a matrix by taking the average along values of some dimension.

From the keras documentation this layer has a data_format argument. By default it is "channels_last" meaning that it will keep the last channel, and take the average along the other.

Here is an example model:

model = Sequential([
    Input((10)),
    Embedding(30, 7, input_length=10),
    GlobalAveragePooling1D()
])

model.summary()

output:

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 10, 7)             210       
                                                                 
 global_average_pooling1d (G  (None, 7)                0         
 lobalAveragePooling1D)                                          
                                                                 
=================================================================
Total params: 210
Trainable params: 210
Non-trainable params: 0
_________________________________________________________________

As you can see, the dimension for a sample was reduced from (10, 7) to (7), meaning it returns the average of the embeddings given.

If you set data_format = "channels_first", like here

model = Sequential([
    Input((10)),
    Embedding(30, 7, input_length=10),
    GlobalAveragePooling1D(data_format = "channels_first")
])

model.summary()

output:

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 10, 7)             210       
                                                                 
 global_average_pooling1d (G  (None, 10)                0         
 lobalAveragePooling1D)                                          
                                                                 
=================================================================
Total params: 210
Trainable params: 210
Non-trainable params: 0
_________________________________________________________________

Here the dimension for a sample was reduced from (10, 7) to (10), meaning it returns the average of values in each embeddings given. What kind of doesn't make sense, since you could set the embedding_dim to 1 and get the same result.