I am working on reducing Disk Requirement for data stored in ElasticSearch. The data is mainly fields which contain list of float values. I am generating the json data from a streaming job and then putting it to Elasticsearch. Does changing the data type from float to short will help in reducing the disk requirement? (keeping in mind that the range of values are well within short range and we can round-off float values.)
I got a doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html But there are contradictory statements here -
- storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.
- This is mostly helpful to save disk space since integers are way easier to compress than floating points.
Can someone help explain?
CodePudding user response:
Float values are 32-bit IEEE 754 floating point number and short values are 16-bit integers, so it's pretty evident that shorts require less disk space.
The first statement you mention is only valid for integer types while the second is about floating point numbers stored as integers, so it's a bit like comparing apples and oranges.
But, for your concrete case, where you have values within a short range, let's make a very naive, yet empirical, test where we generate a substantial amount of random short values (say 1M) and store them in two different indexes, one whose field mapping is short
and another whose field mapping is float
, and then simply compare their size.
Here are the mappings I've used:
PUT shorts
{
"mappings": {
"properties": {
"my_short": {
"type": "short"
}
}
}
}
PUT floats
{
"mappings": {
"properties": {
"my_float": {
"type": "float"
}
}
}
}
And here are the random values stored in it:
{"index":{"_index":"shorts"}}
{"my_short":1799}
{"index":{"_index":"floats"}}
{"my_float":1799}
{"index":{"_index":"shorts"}}
{"my_short":31014}
{"index":{"_index":"floats"}}
{"my_float":31014}
{"index":{"_index":"shorts"}}
{"my_short":-880}
{"index":{"_index":"floats"}}
{"my_float":-880}
{"index":{"_index":"shorts"}}
{"my_short":31159}
{"index":{"_index":"floats"}}
{"my_float":31159}
...
After loading all the data, we can check the respective size of the indices
GET _cat/indices/shorts,floats?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open floats puHwIu5wSSG23QEq4qxROA 1 1 1000000 0 57.7mb 28.8mb
green open shorts mDEHUB3FQoyuMNbsDy3zwA 1 1 1000000 0 53.3mb 26.6mb
So the answer is pretty clear: for the exact same data, float values take up more space!