Running Pod takes a long time for internal service to be accessible-CodePudding

I have implemented a gRPC service, build it into a container, and deployed it using k8s, in particular AWS EKS, as a DaemonSet.

The Pod starts and turns to be in Running status very soon, but it takes very long, typically 300s, for the actual service to be accessible.

In fact, when I run kubectl logs to print the log of the Pod, it is empty for a long time.

I have logged something at the very starting of the service. In fact, my code looks like

package main

func init() {
    log.Println("init")
}

func main() {
  // ...
}

So I am pretty sure when there are no logs, the service is not started yet.

I understand that there may be a time gap between the Pod is running and the actual process inside it is running. However, 300s looks too long for me.

Furthermore, this happens randomly, sometimes the service is ready almost immediately. By the way, my runtime image is based on chromedp headless-shell, not sure if it is relevant.

Could anyone provide some advice for how to debug and locate the problem? Many thanks!

Update

I did not set any readiness probes.

Running kubectl get -o yaml of my DaemonSet gives

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2021-10-13T06:30:16Z"
  generation: 1
  labels:
    app: worker
    uuid: worker
  name: worker
  namespace: collection-14f45957-e268-4719-88c3-50b533b0ae66
  resourceVersion: "47265945"
  uid: 88e4671f-9e33-43ef-9c49-b491dcb578e4
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: worker
      uuid: worker
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "2112"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: worker
        uuid: worker
    spec:
      containers:
      - env:
        - name: GRPC_PORT
          value: "22345"
        - name: DEBUG
          value: "false"
        - name: TARGET
          value: localhost:12345
        - name: TRACKER
          value: 10.100.255.31:12345
        - name: MONITOR
          value: 10.100.125.35:12345
        - name: COLLECTABLE_METHODS
          value: shopping.ShoppingService.GetShop
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: DISTRIBUTABLE_METHODS
          value: collection.CollectionService.EnumerateShops
        - name: PERFORM_TASK_INTERVAL
          value: 0.000000s
        image: xxx
        imagePullPolicy: Always
        name: worker
        ports:
        - containerPort: 22345
          protocol: TCP
        resources:
          requests:
            cpu: 1800m
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - env:
        - name: CAPTCHA_PARALLEL
          value: "32"
        - name: HTTP_PROXY
          value: http://10.100.215.25:8080
        - name: HTTPS_PROXY
          value: http://10.100.215.25:8080
        - name: API
          value: 10.100.111.11:12345
        - name: NO_PROXY
          value: 10.100.111.11:12345
        - name: POD_IP
        image: xxx
        imagePullPolicy: Always
        name: source
        ports:
        - containerPort: 12345
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/ssl/certs/api.crt
          name: ca
          readOnly: true
          subPath: tls.crt
      dnsPolicy: ClusterFirst
      nodeSelector:
        api/nodegroup-app: worker
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: ca
        secret:
          defaultMode: 420
          secretName: ca
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  observedGeneration: 1
  updatedNumberScheduled: 2

Furthermore, there are two containers in the Pod. Only one of them is exceptionally slow to start, and the other one is always fine.

CodePudding user response：

When you use HTTP_PROXY for your solution, watchout how it may route differently from your underlying cluster network - which often result to unexpected timeout.

CodePudding user response：

I have posted community wiki answer to summarize the topic:

As gohm'c has mentioned in the comment:

Do connections made by container "source" always have to go thru HTTP_PROXY, even if it is connecting services in the cluster - do you think possible long time been taken because of proxy? Can try kubectl exec -it <pod> -c <source> -- sh and curl/wget external services.

This is an good observation. Note that some connections can be made directly and that adding extra traffic through the proxy may result in delays. For example, a bottleneck may arise. You can read more information about using an HTTP Proxy to Access the Kubernetes API in the documentation.

Additionally you can also create readiness probes to know when a container is ready to start accepting traffic.

A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.