Is elasticsearch safe in single node in production environment of sensitive information?-CodePudding

Elastic on single node it can be faster than cluster, but what are the advantages and disadvantages of using it in a production environment with only one node.

I have a problem because my DBA wants to use elastic in single node with the justification of speed, but he is not taking into account availability, redundancy and failures against disaster the system is not slow, but all documentation I read about elastic says that in production environment it needs to run in cluster with his nodes/shard, We are an information bureau, we provide data for banks, credit analysis, and numerous large customer applications. Help me with arguments that prove that I am sure that the information we are dealing with needs high availability and redundancy. The index size is about 2.2TB I wanted to run on cluster as the information is very sensitive, but my DBA wants to run on single node, on production environment Help me give him an answer, if he's right or I'm wrong.

CodePudding user response：

You should run a benchmark of the potential workloads if he needs proof.

It isn't really viable to run any real production load on a single system because the shards are what really create the speed of elasticsearch: Queries are run on many shards and partial results are returned to form the full result. If you use a single node to scan the entire dataset it will take a very long time for that single node to process everything.

I guess the issue really comes down to how much your data is getting updated and how many queries are running in parallel.

Without any network traffic it could seem faster on a small workload, but if your search cluster needs to do continuous indexing and run parallel queries it will just get stuck and stop returning results.

CodePudding user response：

If a single-node cluster will be faster or slower than a multi-node cluster depends on the use case and many other factors, the argument that a single-node is faster is not valid without a comparative benchmark of the real use case.

For example, If your use case has more indexing than searchs, a single-node cluster can be faster, if your use case has more searchs than indexing, then a single-node cluster can be slower.

But even if it is faster in a specific use case, running a single-node cluster in production is not recommended and it is very risky.

The main issue is that a single-node has no resilience to failure, if your node is down, your entire cluster is down and until your node is back up running, the data in your elasticsearch cluster is unavailable.

Depending on how you will ingest the data, this can also lead to data loss.

If for some reason the data in your node gets corrupted or lost, you will need to restore it from a previous snapshot. On a multi-node cluster if a node is lost, you can spin-up another one and the cluster will take care of replicating the data, assuming you use replicas for your index.

There are also some limitations on how many shards you can have on a node and how many memory you should use for the java heap memory, the recomendations by elastic in those cases is to try to keep the number of shards per GB of heap memory below 20 and do not use more than 30 GB for the heap memory, so for a single-node this will give you a maximum of 600 shards.

If this is enough, depends on the use case and your indexing/shard strategy.

You should ask yourself if you can afford downtime and lose data, if the answer is no to both, then you should not use a single-node cluster.