Home > Back-end >  tensorflow 2 TextVectorization process tensor and dataset error
tensorflow 2 TextVectorization process tensor and dataset error

Time:04-04

I would like to process text with tensorflow 2.8 on Jupyter notebook.

my code:

import re
import string
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_text as tf_text

def standardize(input_data):
    lowercase_str = tf.strings.lower(input_data)
    a_str = tf.strings.regex_replace(lowercase_str, f"[{re.escape(string.punctuation)}]", "")
    tokenizer = tf_text.WhitespaceTokenizer()
    tokens = tokenizer.tokenize(a_str)
    return tokens

# the input data loaded from text files by TfRecordDataset(file_paths, "GZIP")
# each file can be 200 MB, totally about 300 files
# each file hold the data with multiple columns
# some columns are text
# after loading, the dataset will be accessed by column name 
# e.g. one column is "sports", so the input_dataset["sports"] 
# return a tensor, which is like the following example
my_data_tensor = tf.constant([["SWIM 2008-07 Baseball"], ["Football"]])

tf.print(my_data_tensor)
tf.print(my_data_tensor.shape)
tf.print(f"type is {type(my_data_tensor)}")
text_layer = layers.TextVectorization(
                        standardize = standardize,
                        max_tokens = 10,
                        output_mode = 'int',
                        output_sequence_length=10
                       )

my_dataset = tf.data.Dataset.from_tensor_slices(my_data_tensor)
text_layer.adapt(my_dataset.batch(2)) # error         
processed_text = text_layer(my_dataset)

error:
 ValueError: Exception encountered when calling layer "query_tower" (type QueryTower).
 When using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(2, 1, None) with rank=3

I have tried tf.unstack() and tf.reshape, tf.unbatch, but none of them work. For the given example:

[["SWIM 2008-07 Baseball"], ["Football"]]

what I need:

[["swim 200807 baseball"], ["football"]]
then
it will be encoded as int by the "text_layer"

these data (bach_size=2) will be used for a machine learning model as features.

Did I do something wrong ? thanks

CodePudding user response:

  • Related