After Kubernetes upgrade from 1.18.13 to 1.19.5 I get error bellow for some pods randomly. After some time pod fails to start(it's a simple pod, doesn't belong to deployment)
Warning FailedMount 99s kubelet Unable to attach or mount volumes: unmounted volumes=[red-tmp data logs docker src red-conf], unattached volumes=[red-tmp data logs docker src red-conf]: timed out waiting for the condition
- On 1.18 we don't have such issue, also during upgrade K8S doesn't show any errors or incompatibility messages.
- No additional logs from any other K8S components(tried to increase verbosity level for kubelet)
- Everything is fine with disk space and other host's metrics like LA, RAM
- No network storages, only local data
- PV and PVC are created before pods and we don't change them
- Tried to use higher K8S versions but no luck
We have pretty standard setup without any special customizations:
- CNI: Flannel
- CRI: Docker
- Only one node as master and worker
- 16 cores and 32G RAM
Example of pod definition:
apiVersion: v1
kind: Pod
metadata:
labels:
app: provision
ver: latest
name: provision
namespace: red
spec:
containers:
- args:
- wait
command:
- provision.sh
image: app-tests
imagePullPolicy: IfNotPresent
name: provision
volumeMounts:
- mountPath: /opt/app/be
name: src
- mountPath: /opt/app/be/conf
name: red-conf
- mountPath: /opt/app/be/tmp
name: red-tmp
- mountPath: /var/lib/app
name: data
- mountPath: /var/log/app
name: logs
- mountPath: /var/run/docker.sock
name: docker
dnsConfig:
options:
- name: ndots
value: "2"
dnsPolicy: ClusterFirst
enableServiceLinks: false
restartPolicy: Never
volumes:
- hostPath:
path: /opt/agent/projects/app-backend
type: Directory
name: src
- name: red-conf
persistentVolumeClaim:
claimName: conf
- name: red-tmp
persistentVolumeClaim:
claimName: tmp
- name: data
persistentVolumeClaim:
claimName: data
- name: logs
persistentVolumeClaim:
claimName: logs
- hostPath:
path: /var/run/docker.sock
type: Socket
name: docker
PV
apiVersion: v1
kind: PersistentVolume
metadata:
name: red-conf
labels:
namespace: red
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 2Gi
hostPath:
path: /var/lib/docker/k8s/red-conf
persistentVolumeReclaimPolicy: Retain
storageClassName: red-conf
volumeMode: Filesystem
PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: conf
namespace: red
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Gi
storageClassName: red-conf
volumeMode: Filesystem
volumeName: red-conf
tmp data logs
pv have the same setup as conf
beside path. They have separate folders:
/var/lib/docker/k8s/red-tmp
/var/lib/docker/k8s/red-data
/var/lib/docker/k8s/red-logs
Currently I don't have any clues how to diagnose the issue :(
Would be glad to get advice. Thanks in advance.
CodePudding user response:
you must be using local volumes. Follow the below link to understand how to create storage class, pv's and pvc when you use local volumes.
https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/
CodePudding user response:
I recommend you to start troubleshooting by reviewing the VolumeAttachment events against what node has tied the PV, perhaps your volume is still linked to a node that was in evicted condition and was replaced by a new one.
You can use this command to check your PV name and status:
kubectl get pv
And then, to review what node has the correct volumeattachment, you can use the following command:
kubectl get volumeattachment
Once you get the name of your PV and at what node it is attached, then you will be able to see if the PV is tied to the correct node or if maybe it is tied to a previous node that is not working or was removed. The node gets evicted and scheduled into a new available node from the pool; to know what nodes are ready and running, you can use this command:
kubectl get nodes
If you detect that your PV is tied to the node that no longer exists, you will need to delete the VolumeAttachment with the following command:
kubectl delete volumeattachment [csi-volumeattachment_name]
If you need to review a detailed guide for this troubleshooting, you can follow this link.