Home > front end >  How to tokenize German words using tf.keras.preprocessing.text.Tokenizer
How to tokenize German words using tf.keras.preprocessing.text.Tokenizer

Time:12-17

Suppse we have a deutsch sentence like this: "Ich hab gewonnen!". How can we tokenize the sentence to unique words as we do in english sentences with text.Tokenizer? I mustn't use other libraries like Spacy. Can anyone give ideas? I'm looking for something, like an arguement for Tokenizer.

CodePudding user response:

  1. Most tokenizers are generic algorithms, the language doesn't matter. Any example that works in english will work in german. The exceptions are languages without spaces (Chinese, Japanese, Korean, sentence-piece works for these), or those long composite words (subword tokenizers get around this problem).

  2. tf.keras.preprocessing is deprecated. An all around better tool is tf.keras.layers.TextVectorization. Maybe all you need is:

text_vec = tf.keras.layers.TextVectorization()
text_vec.adapt(german_data)
  • Related