I created an index called my_index
by this command
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
...
...
...
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"items": {
"properties": {
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"information": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
"person": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
inside information
, I had a data that looks like this 100 /INDIA/2022
(pay attention to the space after 100). If i search for 100/INDIA/2022
(no space after 100), elasticsearch will return nothing. If I create new index with no analyzer, 100/INDIA/2022
will return the expected result. Can someone help me for this problem?
CodePudding user response:
synonym
analyzer defined in your index settings, includes tokenizing the text on whitespace. So, on analyzing the text 100 /INDIA/2022
GET 71595890/_analyze
{
"text": "100 /INDIA/2022",
"analyzer": "synonym"
}
Following tokens are produced
{
"tokens" : [
{
"token" : "100",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "/INDIA/2022",
"start_offset" : 4,
"end_offset" : 15,
"type" : "word",
"position" : 1
}
]
}
Since you have not explicitly defined any search_analyzer
then by default index analyzer (which is the analyzer you have defined in your index mapping) is the same as the search analyzer.
So, when you are searching for 100/INDIA/2022
, the text gets tokenized into
{
"tokens" : [
{
"token" : "100/INDIA/2022",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
}
]
}
There is no matching token produced (when compared to 100
and /INDIA/2022
), therefore no documents will match.
In the second case when you have created a new index with no analyzer, then by default standard analyzer is taken.
In the case of standard
analyzer following tokens are produced
{
"tokens" : [
{
"token" : "100",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "india",
"start_offset" : 5,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "2022",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<NUM>",
"position" : 2
}
]
}
The tokens made with 100 /INDIA/2022
and 100/INDIA/2022
with standard analyzer are same as shown above.