Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d] ",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren. I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!
CodePudding user response:
I got it ! The problem is certainly around the word_delimiter
token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot → Power, Shot
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d] """,
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change
to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name
field that does not exists.