How to get index item that has : "name" - "McLaren" by searching with "mcla-CodePudding

Here is the tokenizer -

"tokenizer": {
   "filename" : {
      "pattern" : "[^\\p{L}\\d] ",
      "type" : "pattern"
   }
},

Mapping -

"name": {
      "type": "string",
      "analyzer": "filename_index",
      "include_in_all": true,
      "fields": {
        "raw": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lower_case_sort": {
          "type": "string",
          "analyzer": "naturalsort"
        }
      }
    },

Analyzer -

"filename_index" : {
         "tokenizer" : "filename",
         "filter" : [
          "word_delimiter", 
          "lowercase",
          "russian_stop", 
          "russian_keywords", 
          "russian_stemmer",
          "czech_stop",
          "czech_keywords",
          "czech_stemmer"
        ]
      },

I would like to get index item by searching - mclaren, but the name indexed is McLaren. I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -

{
"query": {
    "filtered": {
        "query": {
            "query_string" : {
                "query" : "mclaren",
                "default_operator" : "AND",
                "analyze_wildcard" : true,
            }
        }
    }
},
"size" :50,
"from" : 0,
"sort": {}
}

How I could accomplish this? Thank you!

CodePudding user response：

I got it ! The problem is certainly around the word_delimiter token filter. By default it :

Split tokens at letter case transitions. For example: PowerShot → Power, Shot

Cf documentation

So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].

analyze example :

POST _analyze
{
  "tokenizer": {
    "pattern": """[^\p{L}\d] """,
    "type": "pattern"
  },
  "filter": [
    "word_delimiter"
  ],
  "text": ["macLaren", "maclaren"]
}

Response:

{
  "tokens" : [
    {
      "token" : "mac",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Laren",
      "start_offset" : 3,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "maclaren",
      "start_offset" : 9,
      "end_offset" : 17,
      "type" : "word",
      "position" : 102
    }
  ]
}

So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)

Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.