elasticsearch can't hanlde space after add synonym analyzer-CodePudding

I created an index called my_index by this command

{
    "settings": {
        "number_of_shards": 1,
        "analysis": {
            "filter": {
                "synonym": {
                    "type": "synonym",
                    "lenient": "true",
                    "synonyms": [
                        ...
                        ...
                        ...
                    ]
                }
            },
            "analyzer": {
                "synonym": {
                    "filter": [
                        "uppercase",
                        "synonym"
                    ],
                    "tokenizer": "whitespace"
                }
            }
        }
    },
    "mappings": {
        "items": {
            "properties": {
                "country": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "information": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "synonym"
                },
                "person": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}

inside information, I had a data that looks like this 100 /INDIA/2022 (pay attention to the space after 100). If i search for 100/INDIA/2022 (no space after 100), elasticsearch will return nothing. If I create new index with no analyzer, 100/INDIA/2022 will return the expected result. Can someone help me for this problem?

CodePudding user response：

synonym analyzer defined in your index settings, includes tokenizing the text on whitespace. So, on analyzing the text 100 /INDIA/2022

GET 71595890/_analyze
{
  "text": "100 /INDIA/2022",
  "analyzer": "synonym"
}

Following tokens are produced

{
  "tokens" : [
    {
      "token" : "100",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/INDIA/2022",
      "start_offset" : 4,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    }
  ]
}

Since you have not explicitly defined any search_analyzer then by default index analyzer (which is the analyzer you have defined in your index mapping) is the same as the search analyzer.

So, when you are searching for 100/INDIA/2022, the text gets tokenized into

{
  "tokens" : [
    {
      "token" : "100/INDIA/2022",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    }
  ]
}

There is no matching token produced (when compared to 100 and /INDIA/2022), therefore no documents will match.

In the second case when you have created a new index with no analyzer, then by default standard analyzer is taken.

In the case of standard analyzer following tokens are produced

{
  "tokens" : [
    {
      "token" : "100",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "india",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "2022",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<NUM>",
      "position" : 2
    }
  ]
}

The tokens made with 100 /INDIA/2022 and 100/INDIA/2022 with standard analyzer are same as shown above.