modifying tokens generated through standard tokenizer-CodePudding

I was trying to understand the working of standard tokenizer. Below is the code inside my tokenizerFactory file:

package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);
    }

    @Override
    public Tokenizer create() {
        StandardTokenizer t = new StandardTokenizer();
        return t;
    }
}

I want to modify each and every token generated through the standard tokenizer. For example, just to test that I can modify the tokens; I want to add an "a" or any other character at the end of every token. I tried to concatenate the "a" character at the end of the token in the return statement of the above create function using the " " operator but it didn't worked. Anyone have any idea on how to implement this?

CodePudding user response：

You can define Pattern replace char filter with custom analyzer. this will add the _a with all the token generated and you not need to update it in Java code.

POST _analyze
{
  "text": [
    "Stack overflow"
  ],
  "tokenizer": "standard", 
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "(\\S )",
      "replacement": "$0_a"
    }
  ]
}

Output:

{
  "tokens" : [
    {
      "token" : "Stack_a",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "overflow_a",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}