I am using tensorflow.keras to predicate what an email is about using the email sender, subject and content.
# Use tokenizers to change email data to model input series
subject_sequence = subject_tk.texts_to_sequences(subject_series)
subject_sequence = sequence.pad_sequences(subject_sequence, maxlen = subject_length)
sender_sequence = subject_tk.texts_to_sequences(sender_series)
sender_sequence = sequence.pad_sequences(sender_sequence, maxlen = sender_length)
body_sequence = body_tk.texts_to_sequences(body_series)
body_sequence = sequence.pad_sequences(body_sequence, maxlen = body_length)
# Run learning model on input series and make predication
predication = email_classification_model.predict([subject_sequence, sender_sequence , body_sequence])
print(predication)
However, I noticed that sometime, like 10%, the model would fail with the following error:
File "mailMonitory.py", line 102, in OnItemAdd
predication = email_classification_model.predict([subject_sequence, sender_sequence , body_sequence])
....
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,253] = 3686 is not in [0, 1897)
[[node model/embedding_1/embedding_lookup (defined at mailMonitory.py:102) ]] [Op:__inference_predict_function_4617]
print(sender_sequence)
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
693 3686 139 169]]
From my testing, the problem is always my tokenizer converts the sender email into a series that contains a out of bound number for the model. Why is this happening? Does my tokenizer not contain enough data or is there something wrong with my mode? How can I fix this?
CodePudding user response:
You usually get this error when you are feeding integer values to your Embedding
layer, which are beyond the size of the defined input_dim
. For example, the first sequence works because all integer values are < input_dim
. The second sequence throws an exception because almost all values are outside the range of possible integers:
import tensorflow as tf
input = tf.keras.layers.Input(shape=(5,))
output = tf.keras.layers.Embedding(input_dim=10, output_dim=5)(input)
model = tf.keras.models.Model(input,output)
print(model(tf.constant([1, 5, 2, 6, 8])))
print(model(tf.constant([1, 12, 18, 19, 10, 4000])))
tf.Tensor(
[[-0.03517901 0.01769676 0.01823583 0.01846877 -0.01214858]
[-0.04662237 -0.01376029 0.04361605 0.0426343 -0.01796628]
[ 0.020581 0.02564194 0.00014243 0.03558977 0.01154976]
[-0.01251727 0.00095896 0.00218729 -0.01606169 0.02248188]
[ 0.03368715 0.01532438 -0.01821761 0.00139984 0.00360139]], shape=(5, 5), dtype=float32)
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-3-cd27383a1b70> in <module>()
6
7 print(model(tf.constant([1, 5, 2, 6, 8])))
----> 8 print(model(tf.constant([1, 12, 18, 19, 10, 4000])))
...
InvalidArgumentError: Exception encountered when calling layer "embedding_1" (type Embedding).
indices[1] = 12 is not in [0, 10) [Op:ResourceGather]
Call arguments received:
• inputs=tf.Tensor(shape=(6,), dtype=float32)
So, the solution is to make sure you are using the correct size for the input_dim
parameter.