ES to rebuild index (reindex) performance optimization Suggestions-CodePudding

Reindex official document
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Introduction of Reindex
5. X version after new Reindex Reindex can directly face data reconstruction in Elasticsearch cluster, if your mapping requires to be rebuilt because of the change, modify Settings requires to be rebuilt or index, with the help of Reindex can easily asynchronous rebuilt, and the data transfer between support across the cluster, such as by day create index can rebuild regularly merge into units of month index, index of course inside to enable _source,

Perform Reindex the analysis of the causes of slow
Reindex core do across the index and data migration across the cluster,
The cause of the slow and optimization idea only include:

1) batch size may be too small,
Need combined heap memory, thread pool to adjust size;
2) the underlying reindex is insensitive to realize, with the aid of scroll parallel optimization way, promote efficiency;
3) across the index, across the cluster is the core of writing data, consider writing promote efficiency optimization Angle,
The practice of Reindex efficiency
Improve written to the batch size value
By default, _reindex use 1000 for batch operation, you can adjust the batch_size in source,

POST _reindex
{
"Source" : {
"Index" : "source",
"Size" : 5000
},
"Dest" : {
"Index" : "dest",
"Routing" : "=cat"
}
}
Batch size is set according to:

(1) using batch request to get the best performance index,
Batch size depends on the data, analysis and the cluster configuration, but a good starting point is each batch 5-15 MB,
Note that this is the physical size, document number is not a good indicator to measure the batch size, for example, if each batch index 1000 documents, :
1) each 1 KB 1000 documents is 1 MB,
2) each of the 100 KB 1000 documents is 100 MB,
These are completely different size,
(2) gradually increasing document size tuning,
1) starting from about 5 to 15 MB of large capacity, increase slowly, until you can't see the performance boost, and began to add batch write concurrency (multithreading, and so on),
2) use kibana, cerebro or iostat, top and ps tools such as monitoring node, to view the resources bottleneck when it began to appear, if you begin to receive EsRejectedExecutionException, you can't keep up with the cluster: at least one resource to achieve the capacity, or reduce the concurrency, or provide more limited resources (e.g., switch from mechanical hard disk to the SSD solid-state drives), or add more nodes,
With the help of a scroll of sliced, promote efficiency of written
Reindex support Sliced insensitive to recreate indexes on a parallel process, the parallelization can improve efficiency, and provide a convenient way to request is decomposed into smaller parts,

Sliced principle (from medcl)
1) used to Scroll interface, slow? If you use Scroll traversal data quantity is large, the data that is really can't accept, now Scroll interface can be concurrent for data traversal,
2) each Scroll requests, can be divided into multiple Slice request, can be understood as slices, each independent parallel Slice, use the Scroll to rebuild or traverse much faster times,

Slicing use, for example,
Slicing set is divided into two ways: manually shard, automatically set subdivision,
Manually shard see website,
Automatically set subdivision is as follows:

POST _reindex? Slices=5 & amp; Refresh
{
"Source" : {
"Index" : "twitter"
},
"Dest" : {
"Index" : "new_twitter
"}
}
Note for the slices size Settings:
1) the size of the slices can be manually specified, or setting up the slices is set to auto, auto meaning is: in view of the single index, slices size=subdivision number; In view of the multiple index, slices=shard minimum,
2) when the number of slices is equal to the number of fragmentation in the index, the most efficient query performance, slices size is greater than the number of fragmentation, rather than promote efficiency, it will increase costs,
3) if the slices digital large (e.g., 500), suggest to choose a lower Numbers, because of excessive slices can affect performance,

ES replications is set to 0
If want to undertake a large number of bulk import, please consider by setting up the index, number_of_replicas to disable copy: 0,
Main reason is: copy document, will copy the entire document sent to the node, and word for word repetition indexing process, this means that each copy will perform analysis, indexing and potential merger process,
On the contrary, if you are using zero copy indexes, and then copy the commissioning of the extraction is finished, the recovery process is essentially a word for word section of network transmission, it is more effective than copying the indexing process,

PUT/my_logs _settings
{
"Number_of_replicas" : 0
}
Increase the refresh interval flat or disable
If you don't need to be close to the accuracy of real-time search results, consider don't index refresh, refresh, the default value is 1 s, when doing reindex every index can be refresh_interval to 30 s or disable (1),
If you are a large number of data import, reindex is this scenario, the first value is set to 1 to disable refresh, after the completion of the reset back to value again!
Methods:

PUT/index_name _settings
{" refresh_interval ": 1}

Reduction method:

PUT/index_name _settings
{" refresh_interval ":" 30 s "}