Extend Elasticsearch's standard Analyzer with additional characters to tokenize on-CodePudding

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.

Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.

The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.

I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?

            es.indices.create(index="mine", body={
                "settings": {
                    "analysis": {
                        "analyzer": {
                            "default": {
                                "type": "custom",
                                # "tokenize_on_chars": ["_"],  # i want this to work with the standard tokenizer without using char group
                                "tokenizer": "standard",
                                "filter": ["lowercase"]
                            }
                        }
                    },
                }
            })
            res = es.indices.analyze(index="mine", body={
                "field": "text",
                "text": "the quick brown_fox_has to be split"
            })

CodePudding user response：

Use normalizer and define it along with your preferred standard tokenizer

POST /_analyze

{
  "char_filter": {
      "type": "mapping",
      "mappings": [
          "_ =>\\u0020" // replace underscore with whitespace
      ]
  },
  "tokenizer": "standard",
  "text": "the quick brown_fox_has to be split"
}