Home > other >  Why do elasticsearch queries require a certain number of characters to return results?
Why do elasticsearch queries require a certain number of characters to return results?

Time:04-22

It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:

    "guid": {
        "type": "text",
        "fields": {
            "keyword": {
                "type": "keyword",
                "ignore_above": 256
            }
        }
    }

I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4

The following query returns the document as-expected:

{"match":{"query":"9996c-1dd8*","fields":["guid"]}}

However this query does not:

{"match":{"query":"9996c-1dd*","fields":["guid"]}}

I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?

CodePudding user response:

Elastic does not require a minimum number of characters. What matters is the generated token.

An exercise that helps to understand is to use _analyzer to see your index tokens.

GET index_001/_analyze
{
  "field": "guid",
  "text": [
    "3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
  ]
}

You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4. Look how the tokens are:

 "tokens" : [
    {
      "token" : "3e49996c",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "1dd8",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "4230",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "8f6f",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "abe4236a6fc4",
      "start_offset" : 24,
      "end_offset" : 36,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]

When you perform the search, the same analyzer that is used in the indexing will be used in the search. When you search for the term "9996c-1dd8*".

GET index_001/_analyze
{
  "field": "guid",
  "text": [
    "9996c-1dd8*"
  ]
}

The generated tokens are:

{
  "tokens" : [
    {
      "token" : "9996c",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "1dd8",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.

When you test with the term "9996c-1dd*", no tokens match, so there are no results.

GET index_001/_analyze
{
  "field": "guid",
  "text": [
    "9996c-1dd*"
  ]
}

Tokens:

{
  "tokens" : [
    {
      "token" : "9996c",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "1dd",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Token "1dd" is not equal to "1dd8".

  • Related