Elasticsearch Completion Suggester - How to discard non-letter characters during indexing?-CodePudding

Here's my index:

PUT autocomplete-food
{
  "mappings": {
    "properties": {
      "suggest": {
        "type": "completion"
      }
    }
  }
}

Adding a document to this index:

PUT autocomplete-food/_doc/1?refresh
{
  "suggest": [
    {
      "input": "Starbucks",
      "weight": 10
    },
    {
      "input": [" (Coffee","Latte","Flat White"],
      "weight": 5
    }
  ]
}

Search query for suggestions:

POST autocomplete-food/_search?pretty
{
  "suggest": {
    "suggest": {
      "prefix": "coff",        
      "completion": {         
          "field": "suggest"
      }
    }
  }
}
Search result:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "suggest" : [
      {
        "text" : "coff",
        "offset" : 0,
        "length" : 4,
        "options" : [
          {
            "text" : " (Coffee",
            "_index" : "autocomplete-food",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 5.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : "Starbucks",
                  "weight" : 10
                },
                {
                  "input" : [
                    " (Coffee",
                    "Latte",
                    "Flat White"
                  ],
                  "weight" : 5
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Notice the "text" value is " (Coffee". I don't want to index/get the non-letter characters. I was expecting that as the default analyzer is "simple" analyzer, this won't happen. But the "input" field in the response also contains the special characters.

How do I achieve discarding the non-letter characters? P.S - Elasticsearch version 7.17

I tried changing the analyzer from default (simple) one to standard. But it did not help.

CodePudding user response：

If you look at the output of the simple analyzer for the input (Coffee you can find this:

POST _analyze
{
  "analyzer": "simple",
  "text": " (Coffee"
}

Results =>

{
  "tokens" : [
    {
      "token" : "coffee",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    }
  ]
}

As you can see, the simple analyzer doesn't index the non-letter characters, which is why you can find the (Coffee suggestion by inputting just coff otherwise it would not work.

Maybe there's a misconception about how analyzers work, because you cannot expect them to modify the content of your documents.

Regarding suggesters, whatever you add as input are the suggestions you'd like to be returned, so you're in charge of making them look like suggestions you'd like to be returned, but the analyzer will not make those modifications for you, only index those terms in the suggester's FST so you can find them.