Home > Software design >  EKS Fargate pod for Airflow keeps restarting with error code
EKS Fargate pod for Airflow keeps restarting with error code

Time:10-18

I am trying to deploy AIrflow on EKS Fargate using Helm. I have the EKS cluster, SC, PV, and PVC, along with namespace and fargate-profile(dev) all set up.

My problem comes when I do helm install:

helm upgrade --install airflow apache-airflow/airflow -n dev --values values.yaml --set volumePermissions.enbled=true --debug

[![list of pods][1]][1]

Above is the list of pods. The last 3 keep going into Crashloopbackoff.

Here is the describe of webserver pod:

C:\Users\tanma>kubectl describe pods -n dev airflow-webserver-775d548b98-wd5x8
Name:                 airflow-webserver-775d548b98-wd5x8
Namespace:            dev
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      airflow-webserver
Node:                 fargate-ip-192-168-161-147.us-west-2.compute.internal/192.168.161.147
Start Time:           Thu, 13 Oct 2022 17:12:54 -0400
Labels:               component=webserver
                      eks.amazonaws.com/fargate-profile=dev
                      pod-template-hash=775d548b98
                      release=airflow
                      tier=airflow
Annotations:          CapacityProvisioned: 0.25vCPU 0.5GB
                      Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
                      checksum/airflow-config: 978d20ff42d3de620bee24f2e35b1769f20ebd948890bf474bd940624e39f150
                      checksum/extra-configmaps: 2e44e493035e2f6a255d08f8104087ff10d30aef6f63176f1b18f75f73295598
                      checksum/extra-secrets: bb91ef06ddc31c0c5a29973832163d8b0b597812a793ef911d33b622bc9d1655
                      checksum/metadata-secret: d9bd679df96f2631a8559d02cc528fd78c3d73c06289be9816d83fb332e05b5e
                      checksum/pgbouncer-config-secret: da52bd1edfe820f0ddfacdebb20a4cc6407d296ee45bcb500a6407e2261a5ba2
                      checksum/webserver-config: 4a2281a4e3ed0cc5e89f07aba3c1bb314ea51c17cb5d2b41e9b045054a6b5c72
                      checksum/webserver-secret-key: a1e18ebcc73a51b6bafe52d95eee84dcdf132559cac0248fff6e58e409b4505e
                      kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   192.168.161.147
IPs:
  IP:           192.168.161.147
Controlled By:  ReplicaSet/airflow-webserver-775d548b98
Init Containers:
  wait-for-airflow-migrations:
    Container ID:  containerd://bf4919f7a268bbeaf1a8f8779e4da1551d76f622d9ce970f18a3f2a1f14c24d7
    Image:         apache/airflow:2.4.1
    Image ID:      docker.io/apache/airflow@sha256:e077b68d81d56d773bddbcdc8941b7a2c16a2087a641005dfc5f1b8dcadec90a
    Port:          <none>
    Host Port:     <none>
    Args:
      airflow
      db
      check-migrations
      --migration-wait-timeout=60
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 13 Oct 2022 17:14:40 -0400
      Finished:     Thu, 13 Oct 2022 17:15:12 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>                      Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pntv6 (ro)
Containers:
  webserver:
    Container ID:  containerd://e479b50af8eefc8c99971cc9cc9b6345f826c09d5f770276b33518340298359d
    Image:         apache/airflow:2.4.1
    Image ID:      docker.io/apache/airflow@sha256:e077b68d81d56d773bddbcdc8941b7a2c16a2087a641005dfc5f1b8dcadec90a
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      bash
      -c
      exec airflow webserver
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Thu, 13 Oct 2022 17:40:25 -0400
      Finished:     Thu, 13 Oct 2022 17:42:19 -0400
    Ready:          False
    Restart Count:  9
    Liveness:       http-get http://:8080/health delay=15s timeout=30s period=5s #success=1 #failure=20
    Readiness:      http-get http://:8080/health delay=15s timeout=30s period=5s #success=1 #failure=20
    Environment:
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>                      Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
      /opt/airflow/logs from logs (rw)
      /opt/airflow/pod_templates/pod_template_file.yaml from config (ro,path="pod_template_file.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pntv6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      airflow-airflow-config
    Optional:  false
  logs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  af-efs-fargate-1
    ReadOnly:   false
  kube-api-access-pntv6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason           Age                  From               Message
  ----     ------           ----                 ----               -------
  Warning  LoggingDisabled  31m                  fargate-scheduler  Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
  Normal   Scheduled        30m                  fargate-scheduler  Successfully assigned dev/airflow-webserver-775d548b98-wd5x8 to fargate-ip-192-168-161-147.us-west-2.compute.internal
  Normal   Pulling          30m                  kubelet            Pulling image "apache/airflow:2.4.1"
  Normal   Pulled           28m                  kubelet            Successfully pulled image "apache/airflow:2.4.1" in 1m43.155801441s
  Normal   Created          28m                  kubelet            Created container wait-for-airflow-migrations
  Normal   Started          28m                  kubelet            Started container wait-for-airflow-migrations
  Normal   Pulled           28m                  kubelet            Container image "apache/airflow:2.4.1" already present on machine
  Normal   Created          28m                  kubelet            Created container webserver
  Normal   Started          28m                  kubelet            Started container webserver
  Warning  Unhealthy        27m (x9 over 27m)    kubelet            Readiness probe failed: Get "http://192.168.161.147:8080/health": dial tcp 192.168.161.147:8080: connect: connection refused
  Warning  Unhealthy        10m (x156 over 27m)  kubelet            Liveness probe failed: Get "http://192.168.161.147:8080/health": dial tcp 192.168.161.147:8080: connect: connection refused
  Warning  BackOff          10s (x44 over 14m)   kubelet            Back-off restarting failed container

Any thoughts on why the pods keep restarting?
Appreciate your help here. 
Thanks


  [1]: https://i.stack.imgur.com/IPocP.png

CodePudding user response:

Your host port is 0. I guess that could cause the webserver not to be able to expose its port. However, you'd have to check the logs of the webserver pod itself to make sure this is the problem.

You need to make sure that this endpoint is available (which is not currently); http://192.168.161.147:8080/health

CodePudding user response:

Ended up increasing the resources for webserver and this solved the problem.

THanks

  • Related