-
Notifications
You must be signed in to change notification settings - Fork 606
Closed
Description
Description
When the cluster is scaling up, many images are downloaded per instance (logging, monitoring, k8s management, and at least 2-3 images per api replica). The default image pull concurrency is 5 QPS per node, which can be hit fairly easily during node initialization. Anecdotally, it took about 20 seconds to for the retries from kubelet to successfully pull the images. This can slowdown the speed of scale ups especially as the number of images per pod and per node increase.
When a pod of an API encounters this issue, the status of this API is calculated to be error (image pull)
which makes it indistinguishable from when an image actually can not be pulled.
The following actions can be taken
- when determining the status of the pod, don't set the status of the pod to error (image pull) if the pod encountered QPS limit
- increase registry-qps setting of the kubelet using eksctl kubeletExtraConfig
Additional Context
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m27s default-scheduler Successfully assigned default/api
Normal Pulling 2m21s kubelet Pulling image "quay.io/cortexlabs/downloader:0.23.0"
Normal Pulled 2m21s kubelet Successfully pulled image "quay.io/cortexlabs/downloader:0.23.0"
Normal Created 2m21s kubelet Created container downloader
Normal Started 2m21s kubelet Started container downloader
Warning Failed 2m19s kubelet Failed to pull image "quay.io/cortexlabs/request-monitor:0.23.0": pull QPS exceeded
Warning Failed 2m19s kubelet Failed to pull image "quay.io/cortexlabs:python-predictor-cpu:0.23.0": pull QPS exceeded
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request