Home > Back-end >  How to query this string in elasticsearch : B:EGXXXXXX:PTP:MM_LMDM_DISP_AL
How to query this string in elasticsearch : B:EGXXXXXX:PTP:MM_LMDM_DISP_AL

Time:07-16

I have a string with : in it, example : B:EGXXXXXX:PTP:MM_LMDM_DISP_AL . The expectation is that when i use

GET index_name/_search
{
  "size": 10, 
  "query": {
    "query_string": {
      "query": "B\\:EGXXXXXX"
    }
  }
}

I get whole string B:EGXXXXXX:PTP:MM_LMDM_DISP_AL back. But the above query returns no result. I can use wildcard to achieve this but I am looking for ways to do this without wildcard. My mapping for this

PUT /index_name?pretty
{
    "settings" : {
        "number_of_shards" : 2,
        "number_of_replicas" : 1
    },
    "mappings" : {
        "properties" : {
            "tags" : { "type" : "text" }
        }
    }
}

I add data using

PUT index_name/_doc/1
{
  "tags": [
    "B:EGXXXXXX:PTP:MM_LMDM_DISP_AL"
  ]
}

CodePudding user response:

Checking the result of how ES analyzes your text reveals:

POST _analyze
{
  "analyzer": "standard",
  "text": "B:EGXXXXXX:PTP:MM_LMDM_DISP_AL"
}
{
  "tokens" : [
    {
      "token" : "b:egxxxxxx:ptp:mm_lmdm_disp_al",
      "start_offset" : 0,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

As you can see it does not split your string at all, since you are using the default analyzer which is "standard" analyzer. So searching for parts of the string will produce no results. That's what you got.

To get the desired result, there are a couple of options, all of which involve changing the way the field is analyzed.

  1. Use edge-ngrams : These allow the user to type in a couple of characters and then produce results. Edge Ngram Tokenizer

  2. Use a completion suggester : This is an order of magnitude faster than edge-n-grams, and as long as you only want to search within the string, and do no other complex queries at the same time, is much faster and easier to configure.Completion Suggester

  3. Use a different analyzer, especially one that splits on punctuations. An example would be the english analyzer. You can see how it would split your text:

{
  "tokens" : [
    {
      "token" : "b",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "egxxxxxx",
      "start_offset" : 2,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ptp",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "mm",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "lmdm",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "disp",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "al",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    }
  ]
}

Reconfigure your mapping specifying the "simple" analyzer for that particular field.

As you can see it splits on the colons, which may be what you plan to allow users to search by. In that case, you can then search for the given string, using a match phrase prefix query and get the desired result.

GET /_search
{
  "query": {
    "match_phrase_prefix": {
      "tags": {
        "query": "B:EGXXXXXX"
      }
    }
  }
}

This will produce the desired result.

To conclude, it really depends on what kind of speed you want with the queries, and how complex they will be in the future. Edge n-grams will offer you the best tradeoff between speed and complexity. But even match phrase prefix will work, except will be much slower.

HTH.

  • Related