Home > Back-end >  Does using short in place of long will help reduce disk requirement for Elasticsearch
Does using short in place of long will help reduce disk requirement for Elasticsearch

Time:10-04

I am working on reducing Disk Requirement for data stored in ElasticSearch. The data is mainly fields which contain list of float values. I am generating the json data from a streaming job and then putting it to Elasticsearch. Does changing the data type from float to short will help in reducing the disk requirement? (keeping in mind that the range of values are well within short range and we can round-off float values.)

I got a doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html But there are contradictory statements here -

  1. storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.
  2. This is mostly helpful to save disk space since integers are way easier to compress than floating points.

Can someone help explain?

CodePudding user response:

Float values are 32-bit IEEE 754 floating point number and short values are 16-bit integers, so it's pretty evident that shorts require less disk space.

The first statement you mention is only valid for integer types while the second is about floating point numbers stored as integers, so it's a bit like comparing apples and oranges.

But, for your concrete case, where you have values within a short range, let's make a very naive, yet empirical, test where we generate a substantial amount of random short values (say 1M) and store them in two different indexes, one whose field mapping is short and another whose field mapping is float, and then simply compare their size.

Here are the mappings I've used:

PUT shorts 
{
  "mappings": {
    "properties": {
      "my_short": {
        "type": "short"
      }
    }
  }
}
PUT floats 
{
  "mappings": {
    "properties": {
      "my_float": {
        "type": "float"
      }
    }
  }
}

And here are the random values stored in it:

{"index":{"_index":"shorts"}}
{"my_short":1799}
{"index":{"_index":"floats"}}
{"my_float":1799}
{"index":{"_index":"shorts"}}
{"my_short":31014}
{"index":{"_index":"floats"}}
{"my_float":31014}
{"index":{"_index":"shorts"}}
{"my_short":-880}
{"index":{"_index":"floats"}}
{"my_float":-880}
{"index":{"_index":"shorts"}}
{"my_short":31159}
{"index":{"_index":"floats"}}
{"my_float":31159}
...

After loading all the data, we can check the respective size of the indices

GET _cat/indices/shorts,floats?v

health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   floats puHwIu5wSSG23QEq4qxROA   1   1    1000000            0     57.7mb         28.8mb
green  open   shorts mDEHUB3FQoyuMNbsDy3zwA   1   1    1000000            0     53.3mb         26.6mb

So the answer is pretty clear: for the exact same data, float values take up more space!

  • Related