I have an index containing city names. I try to correctly score my entries but i do not get the desired results. I have tried to create the index without any settings specified and with an edge-n-gram as well as an n-gram analyzer. The language of the city names is german and i read here, that this should be a fine analyzer. Here are the Settings that i tried for the analyzers:
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "1"
},
"analysis": {
"analyzer": {
"e_ngram_token": {
"tokenizer": "edge_ngram_tokenizer"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram", // exchanged to ngram the other time
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Here is some example data for a bulk creation (/cities/_bulk):
{ "create": { } }
{"name": "Münster"}
{ "create": { } }
{"name": "München"}
{ "create": { } }
{"name": "Bad-Münster Fake 2"}
{ "create": { } }
{"name": "Bad Münster Fake"}
{ "create": { } }
{"name": "Munddort fake"}
{ "create": { } }
{"name": "Stolpmünde"}
{ "create": { } }
{"name": "Swinemünde"}
{ "create": { } }
{"name": "Dortmund"}
{ "create": { } }
{"name": "Müden (Mosel)"}
{ "create": { } }
{"name": "Mannheim"}
{ "create": { } }
{"name": "Marburg"}
{ "create": { } }
{"name": "Magdeburg"}
{ "create": { } }
{"name": "Montreux"}
{ "create": { } }
{"name": "Sankt Moritz"}
so when i run a query like this:
{
"from": 0,
"size": 100,
"query": {
"match": {
"name": {
"query": "mun",
"analyzer": "e_ngram_token",
"fuzziness": "2",
"fuzzy_transpositions": true,
"operator": "or",
"max_expansions": 50,
"boost": 5
}
}
}
}
I would expect to get cities like "München", "Münster" and so on, basically every city with "mun" or, because of the fuzziness, cities with "mün", "man", "tan" and so on. What i get is this:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "cities",
"_type": "_doc",
"_id": "7jX2ioQBc3BSm-EXMB2V",
"_score": 0.0,
"_source": {
"name": "Bad-Münster Fake 2"
}
}
]
}
}
Can somebody explain to me what i am missing? In my Understanding the tokens are created at index time and will be something like `["Mü", "ün", "nc"..."Mün"] for "München". because i request a fuzziness of 2 the term "mun" should match the token "mün" and thus hand back the result.
Thanks a lot!
CodePudding user response:
You must add analyzer in field.
"name": {
"type": "text",
"analyzer": "e_ngram_token" <----------,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}