I am a new newbie DevOps Engineer and I need some help, please go easy on me :)
I was checking out our DEV AKS cluster at work and noticed that Fluentd is using a crazy amount of memory and isn't releasing it back, example below:
fluentd-dev-95qmh 13m 1719Mi
fluentd-dev-fhd4w 9m 1732Mi
fluentd-dev-n22hf 11m 660Mi
fluentd-dev-qlzd8 12m 524Mi
fluentd-dev-rg9gp 9m 2338Mi
Fluentd is deployed as a daemonset so I can't just scale it up or down, unfortunately.
The version we are running is 1.2.22075.8 and it get deployed via CI/CD pipelines using a deployment.yml file and a dockerfile.
Here is the dockerfile:
FROM quay.io/fluentd_elasticsearch/fluentd:v3.2.0
#RUN adduser --uid 10000 --gecos '' --disabled-password fluent --no-create-home && \
#chown fluent:fluent /entrypoint.sh && \
#chown -R fluent:fluent /etc/fluent/ && \
#chown -R fluent:fluent /usr/local/bin/ruby && \
#chown -R fluent:fluent /usr/local/bundle/bin/fluent* && \
#chmod -R fluent:fluent /var/lib/docker/containers && \
#chmod -R fluent:fluent /var/log
#USER fluent
I went to https://quay.io/repository/fluentd_elasticsearch/fluentd?tab=tags&tag=latest and saw that there were newer versions available. I wanted to update Fluentd to v3.3.0 and I thought I could just do this by changing the version number in the dockerfile and triggering a build. I did this the release pipeline failed, two pods were in "CrashLoopBackOff" state and three pods were running normally. I also had a bunch of errors related to Ruby. I know, I should have taken note of the errors but since this was at work I just scared and reverted the version in the dockerfile back to v3.2.0 from v3.3.0 and triggered a build and everything went back to how it was before.
Could someone please help me out? How do I update the version of the Fluentd daemonset? Is there a way I can restart these pods and clear the memory? I've Googled this question and it doesn't seem like there is a way to do this easily because it is not a regular deployment.
Also, any idea why fluentd would be eating so much memory?
Any help would be appreciated, I need to resolve this ASAP because this issue is having a negative impact on the DEV cluster, 3 out of 5 nodes are above 110% memory usage.
Thank you
CodePudding user response:
Regarding the question about restarting the fluentd pods:
If you have permissions to delete pods in the namespace where fluentd is deployed, you can simply delete the pods to restart fluentd
kubectl delete pod fluentd-xxxxx
Since the daemonset definition is still there the Kubernetes control plane will notice there are fluentd pods missing and start new ones. This will however mean that there will be a short time windows where there will be no fluentd running on the node (i.e the time between the delete command is issued and until the new fluentd pod is operational). The Control plane will detect that there are fluentd pods missing almost instantly, but they do take some time to start.
CodePudding user response:
as far as I understand, you want to upgrade fluentd to v3.3.0 and find a solution for the high memory usage.
as far as the update goes, it is correct to build the new container image if you need to. after building the new container image, you have to configure the k8s deployment to use the new build image and then run the deployment again. you can set the container image tag in the deployment file under spec.template.spec.containers[0].image
.
regarding the high memory usage: fluentd always uses some memory (in our clusters around ~700-1000MB per pod) and I consider this to be a normal amount. there are articles that suggest that fluentd can be run with significantly lower memory usage and while this may be true, I've never seen it in a production scenario where the log pipeline is actually processing, sorting and pushing a considerable amount of log data using plugins, filters and such. remeber to set the requests/limits in your k8s deployment accordingly and don't be too restrictive because you don't want your logging pipeline to fail everytime there are network hiccups and such.
if the memory usage is drastically higher than expected (~1.5-2GB or more) it is very likely that your logging backend (elasticsearch) is not able to keep up with the amount of requests and throttles those requests or even denies to accept new ones. to be able to debug this, you have to increase the log level of fluentd and check the logs of your logging backend. if that is the case, you can do one of the following:
- decrease the number of requests per time unit and increase the size of the chunks that fluentd pushes to elasticsearch. this is way easier to process for elasticsearch, because fewer but bigger requests are easier to process than more but smaller ones. this is a fluentd configuration.
- scale your logging backend. if you cannot/don't want to decrease the number of requests from fluentd, or you are not able to handle the problem with fluentd configs alone, you should scale your elasticsearch cluster by adding extra data nodes. if you don't already have them, add dedicated API nodes first to decrease the load on the data nodes before scaling them.
a less likely reason can be that you are using inline ruby expressions instead of plugins in your fluentd pipeline. if that is the case, this can cause performance issues, but try the above mentioned measures before because fluentd pipelines can be tricky if you don't know how to write/edit them.