Flask app deployment failed (Liveness probe failed) after "gcloud builds submit ..."-CodePudding

I am a newbie in frontend/backend/DevOps. But I am in urgent need of using Kubernetes to deploy an app on Google Cloud Platform (GCP) to provide a service. Then I start learning by following this series of tutorials: https://mickeyabhi1999.medium.com/build-and-deploy-a-web-app-with-react-flask-nginx-postgresql-docker-and-google-kubernetes-e586de159a4d https://medium.com/swlh/build-and-deploy-a-web-app-with-react-flask-nginx-postgresql-docker-and-google-kubernetes-341f3b4de322 And the code of this tutorial series is here: https://github.com/abhiChakra/Addition-App

Everything was fine until the last step: using "gcloud builds submit ..." to build

nginx react service
flask wsgi service
nginx react deployment
flask wsgi deployment on a GCP cluster.

1.~3. went well and the status of them are "OK". But the status of flask wsgi deployment was "Does not have minimum availability" even after many times of restarting.

I used "kubectl get pods" and saw the status of the flask pod was "CrashLoopBackOff". Then I followed the processes of debugging suggested here: https://containersolutions.github.io/runbooks/posts/kubernetes/crashloopbackoff/

I used "kubectl describe pod flask" to look into the problem of the flask pod. Then I found the "Exit Code" was 139 and there were messages "Liveness probe failed: Get "http://10.24.0.25:8000/health": read tcp 10.24.0.1:55470->10.24.0.25:8000: read: connection reset by peer" and "Readiness probe failed: Get "http://10.24.0.25:8000/ready": read tcp 10.24.0.1:55848->10.24.0.25:8000: read: connection reset by peer".

The complete log:

Name:         flask-676d5dd999-cf6kt
Namespace:    default
Priority:     0
Node:         gke-addition-app-default-pool-89aab4fe-3l1q/10.140.0.3
Start Time:   Thu, 11 Nov 2021 19:06:24  0800
Labels:       app.kubernetes.io/managed-by=gcp-cloud-build-deploy
              component=flask
              pod-template-hash=676d5dd999
Annotations:  <none>
Status:       Running
IP:           10.24.0.25
IPs:
  IP:           10.24.0.25
Controlled By:  ReplicaSet/flask-676d5dd999
Containers:
  flask:
    Container ID:   containerd://5459b747e1d44046d283a46ec1eebb625be4df712340ff9cf492d5583a4d41d2
    Image:          gcr.io/peerless-garage-330917/addition-app-flask:latest
    Image ID:       gcr.io/peerless-garage-330917/addition-app-flask@sha256:b45d25ffa8a0939825e31dec1a6dfe84f05aaf4a2e9e43d35084783edc76f0de
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 12 Nov 2021 17:24:14  0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Fri, 12 Nov 2021 17:17:06  0800
      Finished:     Fri, 12 Nov 2021 17:19:06  0800
    Ready:          False
    Restart Count:  222
    Limits:
      cpu:  1
    Requests:
      cpu:        400m
    Liveness:     http-get http://:8000/health delay=120s timeout=1s period=5s #success=1 #failure=3
    Readiness:    http-get http://:8000/ready delay=120s timeout=1s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s97x5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-s97x5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s97x5
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  9m7s (x217 over 21h)    kubelet  (combined from similar events): Liveness probe failed: Get "http://10.24.0.25:8000/health": read tcp 10.24.0.1:48636->10.24.0.25:8000: read: connection reset by peer
  Warning  BackOff    4m38s (x4404 over 22h)  kubelet  Back-off restarting failed container

Following the suggestion here: https://containersolutions.github.io/runbooks/posts/kubernetes/crashloopbackoff/#step-4 I had increased the "initialDelaySeconds" to 120, but it still failed.

Because I made sure that everything worked fine on my local laptop, so I think there could be some connection or authentication issue.

To be more detailed, the deployment.yaml looks like:

apiVersion: v1
kind: Service
metadata:
  name: ui
spec:
  type: LoadBalancer
  selector:
    app: react
    tier: ui
  ports:
    - port: 8080
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata: 
  name: flask
spec:
  type: ClusterIP
  selector:
    component: flask
  ports:
    - port: 8000
      targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask
spec:
  replicas: 1
  selector:
    matchLabels:
      component: flask
  template:
    metadata:
      labels:
        component: flask
    spec:
      containers:
        - name: flask
          image: gcr.io/peerless-garage-330917/addition-app-flask:latest
          imagePullPolicy: "Always"
          resources:
            limits:
              cpu: "1000m"
            requests:
              cpu: "400m"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 5
          ports:
            - containerPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: react
      tier: ui
  template:
    metadata:
      labels:
        app: react
        tier: ui
    spec:
      containers:
        - name: ui
          image: gcr.io/peerless-garage-330917/addition-app-nginx:latest
          imagePullPolicy: "Always"
          resources:
            limits:
              cpu: "1000m"
            requests:
              cpu: "400m"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 5
          ports:
            - containerPort: 8080

docker-compose.yaml:

# we will be creating these services
services:
  flask:
    # Note that we are building from our current terminal directory where our Dockerfile is located, we use .
    build: . 
    # naming our resulting container
    container_name: flask
    # publishing a port so that external services requesting port 8000 on your local machine
    # are mapped to port 8000 on our container
    ports:
      - "8000:8000"

  nginx: 
    # Since our Dockerfile for web-server is located in react-app foler, our build context is ./react-app
    build: ./react-app
    container_name: nginx
    ports:
      - "8080:8080"

Nginx Dockerfile:

# first building react project, using node base image
FROM node:10 as build-stage

# setting working dir inside container
WORKDIR /react-app

# required to install packages
COPY package*.json ./

# installing npm packages
RUN npm install

# copying over react source material
COPY src ./src

# copying over further react material
COPY public ./public

# copying over our nginx config file
COPY addition_container_server.conf ./

# creating production build to serve through nginx
RUN npm run build

# starting second, nginx build-stage
FROM nginx:1.15

# removing default nginx config file
RUN rm /etc/nginx/conf.d/default.conf

# copying our nginx config
COPY --from=build-stage /react-app/addition_container_server.conf /etc/nginx/conf.d/

# copying production build from last stage to serve through nginx
COPY --from=build-stage /react-app/build/ /usr/share/nginx/html

# exposing port 8080 on container
EXPOSE 8080

CMD ["nginx", "-g", "daemon off;"]

Nginx server config:

server {

    listen 8080;

    # location of react build files
    root /usr/share/nginx/html/;

    # index html from react build to serve
    index index.html;

    # ONLY KUBERNETES RELEVANT: endpoint for health checkup
    location /health {
        return 200 "health ok";
    }

    # ONLY KUBERNETES RELEVANT: endpoint for readiness checkup
    location /ready {
        return 200 "ready";
    }

    # html file to serve with / endpoint
    location / {
            try_files $uri /index.html;
    }
    
    # proxing under /api endpoint
    location /api {
            client_max_body_size 10m;
            add_header 'Access-Control-Allow-Origin' http://<NGINX_SERVICE_ENDPOINT>:8080;
            proxy_pass http://flask:8000/;
    }
}

There are two important functions in App.js:

...
insertCalculation(event, calculation){
  /*
    Making a POST request via a fetch call to Flask API with numbers of a
    calculation we want to insert into DB. Making fetch call to web server
    IP with /api/insert_nums which will be reverse proxied via Nginx to the
    Application (Flask) server.
  */
    event.preventDefault();

    fetch('http://<NGINX_SERVICE_ENDPOINT>:8080/api/insert_nums', {method: 'POST',
                                                    mode: 'cors',
                                                    headers: {
                                                    'Content-Type' : 'application/json'
                                                    },
                                                    body: JSON.stringify(calculation)}
     ).then((response) => {
...
getHistory(event){
    /*
        Making a GET request via a fetch call to Flask API to retrieve calculations history.
    */

    event.preventDefault()

    fetch('http://<NGINX_SERVICE_ENDPOINT>:8080/api/data', {method: 'GET',
                                             mode: 'cors'
                                          }
    ).then(response => {
...

Flask Dockerfile:

# using base image
FROM python:3.8

# setting working dir inside container
WORKDIR /addition_app_flask

# adding run.py to workdir
ADD run.py .

# adding config.ini to workdir
ADD config.ini .

# adding requirements.txt to workdir
ADD requirements.txt .

# installing flask requirements
RUN pip install -r requirements.txt

# adding in all contents from flask_app folder into a new flask_app folder
ADD ./flask_app ./flask_app

# exposing port 8000 on container
EXPOSE 8000

# serving flask backend through uWSGI server
CMD [ "python", "run.py" ]

run.py:

from gevent.pywsgi import WSGIServer
from flask_app.app import app

# As flask is not a production suitable server, we use will
# a WSGIServer instance to serve our flask application. 
if __name__ == '__main__':  
    WSGIServer(('0.0.0.0', 8000), app).serve_forever()

app.py:

from flask import Flask, request, jsonify
from flask_app.storage import insert_calculation, get_calculations

app = Flask(__name__)

@app.route('/')
def index():
    return "My Addition App", 200

@app.route('/health')
def health():
    return '', 200

@app.route('/ready')
def ready():
    return '', 200

@app.route('/data', methods=['GET'])
def data():
    '''
        Function used to get calculations history
        from Postgres database and return to fetch call in frontend.
    :return: Json format of either collected calculations or error message
    '''

    calculations_history = []

    try:
        calculations = get_calculations()
        for key, value in calculations.items():
            calculations_history.append(value)
    
        return jsonify({'calculations': calculations_history}), 200
    except:
        return jsonify({'error': 'error fetching calculations history'}), 500

@app.route('/insert_nums', methods=['POST'])
def insert_nums():
    '''
        Function used to insert a calculation into our postgres
        DB. Operands of operation received from frontend.
    :return: Json format of either success or failure response.
    '''

    insert_nums = request.get_json()
    firstNum, secondNum, answer = insert_nums['firstNum'], insert_nums['secondNum'], insert_nums['answer']

    try:
        insert_calculation(firstNum, secondNum, answer)
        return jsonify({'Response': 'Successfully inserted into DB'}), 200
    except:
        return jsonify({'Response': 'Unable to insert into DB'}), 500

I can't tell what is going wrong. And I also wonder what should be the better way to debug such a cloud deployment case? Because in normal programs, we can set some breakpoints and print or log something to examine the root location of code that causes the problem, in cloud deployment, however, I lost my direction of debugging.

Thanks for everyone who have read this! Any comment is appreciated!

CodePudding user response：

...Exit Code was 139...

This could mean there's a bug in your Flask app. You can start with minimum spec instead of trying to do all in one goal:

apiVersion: v1
kind: Pod
metadata:
  name: flask
  labels:
    component: flask
spec:
  containers:
  - name: flask
    image: gcr.io/peerless-garage-330917/addition-app-flask:latest
    ports:
    - containerPort: 8000

See if your pod start accordingly. If it does, try connect to it kubectl port-forward <flask pod name> 8000:8000, follow by curl localhost:8000/health. You should watch your application at all time kubectl logs -f <flask pod name>.

CodePudding user response：

Thanks for @gohm'c response! It is a good suggestion to isolate different parts and start from a smaller component. As suggested, I tried deploying a single flask pod first. Then I used

kubectl port-forward flask 8000:8000

to map the port to local machine. After using curl localhost:8000/health to access the port, it showed

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Handling connection for 8000
E1112 18:52:15.874759  300145 portforward.go:400] an error occurred forwarding 8000 -> 8000: error forwarding port 8000 to pod 4870b939f3224f968fd5afa4660a5af7d10e144ee85149d69acff46a772e94b1, uid : failed to execute portforward in network namespace "/var/run/netns/cni-32f718f0-1248-6da4-c726-b2a5bf1918db": read tcp4 127.0.0.1:38662->127.0.0.1:8000: read: connection reset by peer

At this moment, using

kubectl logs -f flask

returned empty response. So there is indeed some issues in the flask app.

This health probing is a really simple function in app.py:

@app.route('/health')
def health():
    return '', 200

How can I know if the route setting is wrong or not? Is it because of the WSGIServer in run.py?

from gevent.pywsgi import WSGIServer
from flask_app.app import app

# As flask is not a production suitable server, we use will
# a WSGIServer instance to serve our flask application. 
if __name__ == '__main__':  
    WSGIServer(('0.0.0.0', 8000), app).serve_forever()

If we look at Dockerfile, it seems it exposes the correct port 8000. If I directly run

python run.py

on my laptop, I can successfully access localhost:8000 . How can I debug with this kind of problem?