Home > front end >  Why in Keras embedding layer's matrix is a size of vocab_size 1?
Why in Keras embedding layer's matrix is a size of vocab_size 1?

Time:05-19

I have below toy example where my vocabulary size is 7, embedding size is 8 BUT weights output of Keras Embedding layer is 8x8. (?) How is that? This seems to be connected to other questions related to Keras embedding layer being "maximum integer index 1" and I've read all the other stackoverflow queries on this, but all of them suggest it's not vocab_size 1 while my code tells me it is. I'm asking this as I'd need to know which exactly embeding vector relates to which word.

docs = ['Well done!',
            'Good work',
            'Great effort',
            'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word

{1: 'work', 2: 'well', 3: 'done', 4: 'good', 5: 'great', 6: 'effort', 7: 'nice'}

    len(tokenizer.index_word) # 7
    vocab_size = len(tokenizer.index_word) 1 
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
    model.add(Flatten())
    model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_lay (Embedding)    (None, 2, 8)              64        
_________________________________________________________________
flatten_1 (Flatten)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0

model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()

[array([[-0.0389936 , -0.0294274 ,  0.02361362,  0.01885288, -0.01246006,
         -0.01004354,  0.01321061, -0.02298149],
        [-0.01264734, -0.02058442,  0.0114141 , -0.02725944, -0.06267354,
          0.05148344, -0.02335678, -0.06039589],
        [ 0.0582506 ,  0.00020944, -0.04691287,  0.02985037,  0.02437406,
         -0.02782   ,  0.00378997,  0.01849808],
        [-0.01667434, -0.00078654, -0.04029636, -0.04981862,  0.01762467,
          0.06667487,  0.00302309,  0.02881355],
        [ 0.04509508, -0.01994639,  0.01837089, -0.00047283,  0.01141069,
         -0.06225454,  0.01198813,  0.02102971],
        [ 0.05014603,  0.04591557, -0.03119368,  0.04181939,  0.02837115,
         -0.01640332,  0.0577693 ,  0.01364574],
        [ 0.01948108, -0.04200416, -0.06589368, -0.05397511,  0.02729052,
          0.04164972, -0.03795817, -0.06763416],
        [ 0.01284658,  0.05563928, -0.026766  ,  0.03231764, -0.0441488 ,
         -0.02879154,  0.02092744,  0.01947528]], dtype=float32)]

So how do I get my 7 words vectors for instance for {1: 'work'...} from 8th vectors (rows) matrix and what does that 8th vector mean ? If I change vocab_size = len(tokenizer.index_word) - not adding ( 1) then when trying to fit the model I'm getting size errors etc.

CodePudding user response:

The Embedding layer uses tf.nn.embedding_lookup under the hood, which is zero-based by default. For example:

import tensorflow as tf
import numpy as np

docs = ['Well done!',
            'Good work',
            'Great effort',
            'nice work']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8

tf.random.set_seed(111)

# Create integer embeddings for demonstration purposes.
embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)

print(padded_seq)

tf.nn.embedding_lookup(embeddings, padded_seq)
[[2 3]
 [4 1]
 [5 6]
 [7 1]]
<tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
array([[[17., 11., 10., 16., 17., 16., 16., 17.],
        [18., 15., 13., 13., 18., 18., 10., 16.]],

       [[17., 16., 13., 12., 13., 15., 19., 14.],
        [12., 15., 12., 15., 10., 19., 15., 12.]],

       [[18., 15., 11., 13., 13., 13., 16., 10.],
        [18., 18., 11., 12., 10., 13., 14., 10.]],

    --> [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.] <--,
        [12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>

Notice how the integer 7 is mapped to zero, because the tf.nn.embedding_lookup only knows how to map values from 0 to 6. That is the reason, you should use vocab_size = len(tokenizer.index_word) 1, since you want a meaningful vector for the integer 7:

embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)

tf.nn.embedding_lookup(embeddings, padded_seq)

The index 0 could then be reserved for unknown tokens, since your vocabulary starts from 1.

  • Related