Elasticsearch disk usage issue-CodePudding

i have a strange issue in my elasticsearch cluster so i have 5 nodes ( 4 data and masters and 1 master only node ) so each node has 5.7 tb disk space on it but on the first node my disk is almost completely full, and on the rest it is half full the number of shards on all nodes is approximately the same

df -h from first node

/dev/mapper/vg1-data  5.8T  5.1T  717G  88% /var/lib/elasticsearch

and here is /cat/shards output

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
   354          5tb       5tb    714.9gb      5.7tb           87 10.0.5.21 10.0.5.21 elastic-01
   392        3.2tb     3.2tb      2.5tb      5.7tb           55 10.0.5.23 10.0.5.23 elastic-02
   393        3.8tb     3.8tb      1.8tb      5.7tb           67 10.0.5.28 10.0.5.28 elastic-07
   392        3.9tb     3.9tb      1.7tb      5.7tb           69 10.0.5.27 10.0.5.27 elastic-06

i tried summing the results from cat/shards | grep elastic-01 and it turned out that all shards on this node occupy 3.5 tb

curl -X GET   http://10.0.5.22:9200/_cat/shards | grep elastic-01 | awk '{ print $6 }'
  
94.3kb
279b
1.4gb
333.9mb
13.2gb
260.4mb
11.5gb
20.3gb
28.5gb
10gb
12.8gb
365.7mb
9.1gb
263.3mb
92.5gb
951.1kb
266.4mb
35.9gb
10.8gb
299.6mb
22gb
526kb
31.2mb
110.1mb
1mb
46.9gb
19.3gb
358.1kb
17.9gb
22.4kb
11.7gb
3.9gb
5.1gb
427.2mb
1.1mb
48.4gb
elastic-01
75.3mb
6.7gb
30.6gb
43.8gb
31.1mb
21.3gb
10.7gb
1.1gb
17gb
5.1gb
38.4gb
49.1gb
20.2mb
7gb
7.3mb
7.3mb
383.1mb
322.7mb
130.9gb
18.5gb
34.1gb
291.8mb
537.3mb
1.6gb
15.6gb
96.4mb
7.4mb
5.8gb
114.3gb
4.3gb
25gb
7.4gb
7.4gb
638.1kb
10.5gb
175.6kb
275.9mb
33.2mb
806.8kb
35.5gb
40.1gb
17.1gb
408.6mb
115.2mb
69mb
20.3gb
542.4kb
28.4gb
385.6mb
12.9gb
1.3mb
5.5mb
66.6mb
17.5gb
18.7gb
35.6gb
10.9gb
986.3kb
10.3gb
19.1gb
412.8mb
34.4gb
22.6gb
5.1gb
883.4kb
5.3gb
10.4gb
276.4mb
31.9gb
34.5gb
58.1gb
22.3gb
18.8gb
93.9kb
176.5gb
249.3mb
38.1kb
12.1gb
19.7gb
7.6gb
24.7gb
779.9kb
11.2gb
4.9mb
19.1gb
1.2gb
21.1gb
30.4gb
3.8gb
276.5kb
26.3gb
379.9mb
10.4gb
5.5gb
31gb
802.4kb
868.3kb
43.9gb
5.8gb
463.5mb
18.7gb
3.3gb
12gb
4.3gb
32.1gb
3.3gb
11.3gb
1.2mb
944kb
118.2mb
25.8gb
23.9gb
799kb
410.4mb
6mb
5.1gb
32gb
30gb
7.8gb
32.3gb
24.9gb
25.1gb
18gb
16.4gb
1.2gb
915.2kb
4.9mb
29.2gb
59.5kb
1.3gb
150.8gb
1.6gb
11.2gb
17.4gb
439.4mb
6.3mb
21.6gb
394.9mb
26.9gb
23.5gb
43.8gb
28gb
8.9gb
19.5gb
30.3gb
31.8gb
14.7gb
19gb
34.9gb
41.3kb
63.4gb
41.8gb
22.7gb
15gb
32.6gb
281.4mb
379.5mb
8.6mb
3.6mb
37.7gb
10.9gb
818.7kb
19gb
115kb
112.3kb
10gb
7.4mb
685.2kb
332.9mb
5gb
20.2gb
39.5gb
8.6mb
289.5mb
19.3mb
289.6mb
1.1gb
1.6gb
24.8gb
18.1mb
915kb
22.4gb
5.8mb
429mb
261b
20.3gb
930.8kb
19.2gb
25.6gb
31gb
26.6gb
20.1gb
20.2gb
538.4kb
27.4gb
1.2mb
290.6mb
403.6mb
77.4mb
41.7gb
2.7gb
3gb
17.7gb
11.3gb
15.9gb
282.4mb
10.7gb
962.9kb
888.6kb
16.9gb
176.9gb
11.6gb
21.4gb
5.1mb
26.1gb
331.1mb
3.9gb
9.6gb
29.6gb
7.8gb
17.8gb
19.2gb
7.5gb
388.8mb
43.4gb
31.5gb
3gb
21.6mb
15.2gb
11.2gb
54.1gb
17.4gb
1.5gb
34.8gb
273.1mb
32.3gb
17.7gb
2.2gb
17.5gb
22.6gb
820.7kb
1gb
6.6gb
7.8mb
9.3gb
34.5gb
24.1gb
32.9gb
25.2gb
2.9gb
2.6gb
4.6mb
42.8gb
9.3gb
17.9kb
23.4gb
1.1gb
20.6gb
18.1gb
27gb
25.7gb
5mb
32.5gb
29.1gb
42kb
22.5gb
3.1mb
22.6gb
9.8gb
11gb
28.5gb
14.2gb
89.2kb
34.5gb
41.8gb
25gb
410.2mb
20.6gb
16.5gb
16.2gb
19.8gb
7.3gb
13.4gb
11.4gb
10.4gb
11.8gb
7.3mb
1.1gb
46.9gb
10.4gb
535.6mb
55.5gb
19.2gb
14.1gb
20.3gb
28.9gb
30.5gb
4.7gb
49.4gb
7.7gb
9.7gb
6.6gb
20.7gb
29.2gb
18.9gb
9.3gb
19gb
757.4kb
902.4kb

but why does both elastic and du -hs show that more space is being used?

du -hs inside /var/lib/elasticsearch shows 5.1 tb too

du-hs*
5.1T nodes
4.0K range

CodePudding user response：

You should change your URL to

http://10.0.5.22:9200/_cat/shards?bytes=b

So that you get whole numbers instead of human readable ones. You probably have an issue adding up kb/mb/gb figures because when all figures are whole numbers without units, I get 5,265,110,799,379 which matches pretty well with the results you get from _cat/allocation and du -hs

As to the reason why elastic-01 uses more disk space than the other nodes, it's because it seems to have very big shards on it. From the list you shared, you can see these shard sizes (sort desc):

176 900 000 000
176 500 000 000
150 800 000 000
130 900 000 000
114 300 000 000
92 500 000 000
63 400 000 000
58 100 000 000
55 500 000 000
54 100 000 000
49 400 000 000
49 100 000 000

We can see 5 shards whose size is way above 100GB each, and this is usually not a good sign, because your shards have grown too big. Remember that shards are the unit of partitioning of your indexes.

I'm pretty sure there aren't that big shards on your other nodes.

There are a couple ways forward. First, I would check if the index containing those shards contains time-based data and the index is not using Index Lifecycle Management.

Second, I would check if those shards contain a lot of deleted documents by looking at the docs.deleted column resulting from the following command:

GET _cat/indices?v

If that's the case (in the case the index holds documents being frequently updated), it might be possible to regain some space by running

POST <index_name>/_forcemerge?only_expunge_deletes=true

The previous command should be run with great care, because it requires disk space and you don't have much left, so it might not be possible in your case.

There are other ways, but I would first investigate these two points first.