Elasticsearch - mapping with type text and keyword tokenizer, how it is indexed?-CodePudding

i'm new to Elastisearch and a little confused of how a certain field is stored in the Lucene index, since i get the error: Document contains at least one immense term in field="originalrow.sortable" .....bytes can be at most 32766 in length; got 893970"

The mapping in the index template:

 "analyzer" : {
    "rebuilt_hungarian" : {
      "filter" : [
        "lowercase",
        "hungarian_stop",
        "hungarian_keywords",
        "hungarian_stemmer",
        "asciifolding"
      ],
      "tokenizer" : "standard"
    },
    "lowercase_for_sort" : {
      "filter" : [
        "lowercase"
      ],
      "tokenizer" : "keyword"
    }
  }
  ..
  ..
    "dynamic_templates" : [
    {
      "sortable_text" : {
        "mapping" : {
          "analyzer" : "rebuilt_hungarian",
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            },
            "sortable" : {
              "fielddata" : true,
              "analyzer" : "lowercase_for_sort",
              "type" : "text"
            }
          }
        },
        "match_mapping_type" : "string"
      }
    }
  ],

and the generated mapping for the field involved in the error:

"originalrow" : {
  "type" : "text",
  "fields" : {
    "keyword" : {
      "type" : "keyword"
    },
    "sortable" : {
      "type" : "text",
      "analyzer" : "lowercase_for_sort",
      "fielddata" : true
    }
  },
  "analyzer" : "rebuilt_hungarian"
}

So as i think - i might be wrong, of course - is that originarow.sortable field is indexed as text but the whole text goes into the inverted index because of the keyword tokenizer, and that might be the cause of the error. Another thing is that the lenght of the text is about 1800 charaters and i have no clue how the size can exceed 32K bytes.

Thank you in advance!!!

CodePudding user response：

For your field sortable you are using the lowercase_for_sort which again uses the keyword tokenizer that results in single token, and in Lucene largest size of a token is 32766 as explained in this post.

And if you are using some characters which takes more than 1 bytes than you can cross this limit. From UTF docs

A UTF maps each Unicode code point to a unique code unit sequence. A code unit is the minimal bit combination that can represent a character. Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes). Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).