Skip to content

APIs temporarily encounter the error (image pull) during scale up #1989

@vishalbollu

Description

@vishalbollu

Description

When the cluster is scaling up, many images are downloaded per instance (logging, monitoring, k8s management, and at least 2-3 images per api replica). The default image pull concurrency is 5 QPS per node, which can be hit fairly easily during node initialization. Anecdotally, it took about 20 seconds to for the retries from kubelet to successfully pull the images. This can slowdown the speed of scale ups especially as the number of images per pod and per node increase.

When a pod of an API encounters this issue, the status of this API is calculated to be error (image pull) which makes it indistinguishable from when an image actually can not be pulled.

The following actions can be taken

  • when determining the status of the pod, don't set the status of the pod to error (image pull) if the pod encountered QPS limit
  • increase registry-qps setting of the kubelet using eksctl kubeletExtraConfig

Additional Context

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  2m27s                  default-scheduler  Successfully assigned default/api
  Normal   Pulling    2m21s                  kubelet            Pulling image "quay.io/cortexlabs/downloader:0.23.0"
  Normal   Pulled     2m21s                  kubelet            Successfully pulled image "quay.io/cortexlabs/downloader:0.23.0"
  Normal   Created    2m21s                  kubelet            Created container downloader
  Normal   Started    2m21s                  kubelet            Started container downloader
  Warning  Failed     2m19s                  kubelet            Failed to pull image "quay.io/cortexlabs/request-monitor:0.23.0": pull QPS exceeded
  Warning  Failed     2m19s                  kubelet            Failed to pull image "quay.io/cortexlabs:python-predictor-cpu:0.23.0": pull QPS exceeded

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions