APIs temporarily encounter the error (image pull) during scale up

### Description

When the cluster is scaling up, many images are downloaded per instance (logging, monitoring, k8s management, and at least 2-3 images per api replica). The default image pull concurrency is 5 QPS per node, which can be hit fairly easily during node initialization. Anecdotally, it took about 20 seconds to for the retries from kubelet to successfully pull the images. This can slowdown the speed of scale ups especially as the number of images per pod and per node increase.

When a pod of an API encounters this issue, the status of this API is calculated to be `error (image pull)` which makes it indistinguishable from when an image actually can not be pulled.

The following actions can be taken
- [ ] when determining the status of the pod, don't set the status of the pod to error (image pull) if the pod encountered QPS limit
- [ ] increase registry-qps setting of the kubelet using [eksctl kubeletExtraConfig](https://eksctl.io/usage/customizing-the-kubelet/)

### Additional Context

```
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  2m27s                  default-scheduler  Successfully assigned default/api
  Normal   Pulling    2m21s                  kubelet            Pulling image "quay.io/cortexlabs/downloader:0.23.0"
  Normal   Pulled     2m21s                  kubelet            Successfully pulled image "quay.io/cortexlabs/downloader:0.23.0"
  Normal   Created    2m21s                  kubelet            Created container downloader
  Normal   Started    2m21s                  kubelet            Started container downloader
  Warning  Failed     2m19s                  kubelet            Failed to pull image "quay.io/cortexlabs/request-monitor:0.23.0": pull QPS exceeded
  Warning  Failed     2m19s                  kubelet            Failed to pull image "quay.io/cortexlabs:python-predictor-cpu:0.23.0": pull QPS exceeded
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

APIs temporarily encounter the error (image pull) during scale up #1989

Description

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

APIs temporarily encounter the error (image pull) during scale up #1989

Description

Description

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions