Suppse we have a deutsch sentence like this: "Ich hab gewonnen!". How can we tokenize the sentence to unique words as we do in english sentences with text.Tokenizer? I mustn't use other libraries like Spacy. Can anyone give ideas? I'm looking for something, like an arguement for Tokenizer.
CodePudding user response:
Most tokenizers are generic algorithms, the language doesn't matter. Any example that works in english will work in german. The exceptions are languages without spaces (Chinese, Japanese, Korean, sentence-piece works for these), or those long composite words (subword tokenizers get around this problem).
tf.keras.preprocessing is deprecated. An all around better tool is tf.keras.layers.TextVectorization. Maybe all you need is:
text_vec = tf.keras.layers.TextVectorization()
text_vec.adapt(german_data)