|
| 1 | + |
| 2 | ++++ |
| 3 | +disableToc = false |
| 4 | +title = "AIKit" |
| 5 | +description="AI + BuildKit = AIKit: Build and deploy large language models easily" |
| 6 | +weight = 2 |
| 7 | ++++ |
| 8 | + |
| 9 | +GitHub Link - https://github.com/sozercan/aikit |
| 10 | + |
| 11 | +[AIKit](https://github.com/sozercan/aikit) is a quick, easy, and local or cloud-agnostic way to get started to host and deploy large language models (LLMs) for inference. No GPU, internet access or additional tools are needed to get started except for [Docker](https://docs.docker.com/desktop/install/linux-install/)! |
| 12 | + |
| 13 | +AIKit uses [LocalAI](https://localai.io/) under-the-hood to run inference. LocalAI provides a drop-in replacement REST API that is OpenAI API compatible, so you can use any OpenAI API compatible client, such as [Kubectl AI](https://github.com/sozercan/kubectl-ai), [Chatbot-UI](https://github.com/sozercan/chatbot-ui) and many more, to send requests to open-source LLMs powered by AIKit! |
| 14 | + |
| 15 | +> At this time, AIKit is tested with LocalAI `llama` backend. Other backends may work but are not tested. Please open an issue if you'd like to see support for other backends. |
| 16 | +
|
| 17 | +## Features |
| 18 | + |
| 19 | +- 🐳 No GPU, Internet access or additional tools needed except for [Docker](https://docs.docker.com/desktop/install/linux-install/)! |
| 20 | +- 🤏 Minimal image size, resulting in less vulnerabilities and smaller attack surface with a custom [distroless](https://github.com/GoogleContainerTools/distroless)-based image |
| 21 | +- 🚀 Easy to use declarative configuration |
| 22 | +- ✨ OpenAI API compatible to use with any OpenAI API compatible client |
| 23 | +- 🚢 Kubernetes deployment ready |
| 24 | +- 📦 Supports multiple models with a single image |
| 25 | +- 🖥️ Supports GPU-accelerated inferencing with NVIDIA GPUs |
| 26 | +- 🔐 Signed images for `aikit` and pre-made models |
| 27 | + |
| 28 | +## Pre-made Models |
| 29 | + |
| 30 | +AIKit comes with pre-made models that you can use out-of-the-box! |
| 31 | + |
| 32 | +### CPU |
| 33 | +- 🦙 Llama 2 7B Chat: `ghcr.io/sozercan/llama2:7b` |
| 34 | +- 🦙 Llama 2 13B Chat: `ghcr.io/sozercan/llama2:13b` |
| 35 | +- 🐬 Orca 2 13B: `ghcr.io/sozercan/orca2:13b` |
| 36 | + |
| 37 | +### NVIDIA CUDA |
| 38 | + |
| 39 | +- 🦙 Llama 2 7B Chat (CUDA): `ghcr.io/sozercan/llama2:7b-cuda` |
| 40 | +- 🦙 Llama 2 13B Chat (CUDA): `ghcr.io/sozercan/llama2:13b-cuda` |
| 41 | +- 🐬 Orca 2 13B (CUDA): `ghcr.io/sozercan/orca2:13b-cuda` |
| 42 | + |
| 43 | +> CUDA models includes CUDA v12. They are used with [NVIDIA GPU acceleration](#gpu-acceleration-support). |
| 44 | +
|
| 45 | +## Quick Start |
| 46 | + |
| 47 | +### Creating an image |
| 48 | + |
| 49 | +> This section shows how to create a custom image with models of your choosing. If you want to use one of the pre-made models, skip to [running models](#running-models). |
| 50 | +> |
| 51 | +> Please see [models folder](./models/) for pre-made model definitions. You can find more model examples at [go-skynet/model-gallery](https://github.com/go-skynet/model-gallery). |
| 52 | +
|
| 53 | +Create an `aikitfile.yaml` with the following structure: |
| 54 | + |
| 55 | +```yaml |
| 56 | +#syntax=ghcr.io/sozercan/aikit:latest |
| 57 | +apiVersion: v1alpha1 |
| 58 | +models: |
| 59 | + - name: llama-2-7b-chat |
| 60 | + source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf |
| 61 | +``` |
| 62 | +
|
| 63 | +> This is the simplest way to get started to build an image. For full `aikitfile` specification, see [specs](docs/specs.md). |
| 64 | + |
| 65 | +First, create a buildx buildkit instance. Alternatively, if you are using Docker v24 with [containerd image store](https://docs.docker.com/storage/containerd/) enabled, you can skip this step. |
| 66 | + |
| 67 | +```bash |
| 68 | +docker buildx create --use --name aikit-builder |
| 69 | +``` |
| 70 | + |
| 71 | +Then build your image with: |
| 72 | + |
| 73 | +```bash |
| 74 | +docker buildx build . -t my-model -f aikitfile.yaml --load |
| 75 | +``` |
| 76 | + |
| 77 | +This will build a local container image with your model(s). You can see the image with: |
| 78 | + |
| 79 | +```bash |
| 80 | +docker images |
| 81 | +REPOSITORY TAG IMAGE ID CREATED SIZE |
| 82 | +my-model latest e7b7c5a4a2cb About an hour ago 5.51GB |
| 83 | +``` |
| 84 | + |
| 85 | +### Running models |
| 86 | + |
| 87 | +You can start the inferencing server for your models with: |
| 88 | + |
| 89 | +```bash |
| 90 | +# for pre-made models, replace "my-model" with the image name |
| 91 | +docker run -d --rm -p 8080:8080 my-model |
| 92 | +``` |
| 93 | + |
| 94 | +You can then send requests to `localhost:8080` to run inference from your models. For example: |
| 95 | + |
| 96 | +```bash |
| 97 | +curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ |
| 98 | + "model": "llama-2-7b-chat", |
| 99 | + "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}] |
| 100 | + }' |
| 101 | +{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} |
| 102 | +``` |
| 103 | + |
| 104 | +## Kubernetes Deployment |
| 105 | + |
| 106 | +It is easy to get started to deploy your models to Kubernetes! |
| 107 | + |
| 108 | +Make sure you have a Kubernetes cluster running and `kubectl` is configured to talk to it, and your model images are accessible from the cluster. |
| 109 | + |
| 110 | +> You can use [kind](https://kind.sigs.k8s.io/) to create a local Kubernetes cluster for testing purposes. |
| 111 | + |
| 112 | +```bash |
| 113 | +# create a deployment |
| 114 | +# for pre-made models, replace "my-model" with the image name |
| 115 | +kubectl create deployment my-llm-deployment --image=my-model |
| 116 | +
|
| 117 | +# expose it as a service |
| 118 | +kubectl expose deployment my-llm-deployment --port=8080 --target-port=8080 --name=my-llm-service |
| 119 | +
|
| 120 | +# easy to scale up and down as needed |
| 121 | +kubectl scale deployment my-llm-deployment --replicas=3 |
| 122 | +
|
| 123 | +# port-forward for testing locally |
| 124 | +kubectl port-forward service/my-llm-service 8080:8080 |
| 125 | +
|
| 126 | +# send requests to your model |
| 127 | +curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ |
| 128 | + "model": "llama-2-7b-chat", |
| 129 | + "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}] |
| 130 | + }' |
| 131 | +{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} |
| 132 | +``` |
| 133 | + |
| 134 | +> For an example Kubernetes deployment and service YAML, see [kubernetes folder](./kubernetes/). Please note that these are examples, you may need to customize them (such as properly configured resource requests and limits) based on your needs. |
| 135 | + |
| 136 | +## GPU Acceleration Support |
| 137 | + |
| 138 | +> At this time, only NVIDIA GPU acceleration is supported. Please open an issue if you'd like to see support for other GPU vendors. |
| 139 | + |
| 140 | +### NVIDIA |
| 141 | + |
| 142 | +AIKit supports GPU accelerated inferencing with [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). You must also have [NVIDIA Drivers](https://www.nvidia.com/en-us/drivers/unix/) installed on your host machine. |
| 143 | + |
| 144 | +For Kubernetes, [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) provides a streamlined way to install the NVIDIA drivers and container toolkit to configure your cluster to use GPUs. |
| 145 | + |
| 146 | +To get started with GPU-accelerated inferencing, make sure to set the following in your `aikitfile` and build your model. |
| 147 | + |
| 148 | +```yaml |
| 149 | +runtime: cuda # use NVIDIA CUDA runtime |
| 150 | +f16: true # use float16 precision |
| 151 | +gpu_layers: 35 # number of layers to offload to GPU |
| 152 | +low_vram: true # for devices with low VRAM |
| 153 | +``` |
| 154 | + |
| 155 | +> Make sure to customize these values based on your model and GPU specs. |
| 156 | + |
| 157 | +After building the model, you can run it with [`--gpus all`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#gpu-enumeration) flag to enable GPU support: |
| 158 | + |
| 159 | +```bash |
| 160 | +# for pre-made models, replace "my-model" with the image name |
| 161 | +docker run --rm --gpus all -p 8080:8080 my-model |
| 162 | +``` |
| 163 | + |
| 164 | +If GPU acceleration is working, you'll see output that is similar to following in the debug logs: |
| 165 | + |
| 166 | +```bash |
| 167 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr ggml_init_cublas: found 1 CUDA devices: |
| 168 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr Device 0: Tesla T4, compute capability 7.5 |
| 169 | +... |
| 170 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: using CUDA for GPU acceleration |
| 171 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: mem required = 70.41 MB (+ 2048.00 MB per state) |
| 172 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading 32 repeating layers to GPU |
| 173 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading non-repeating layers to GPU |
| 174 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading v cache to GPU |
| 175 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading k cache to GPU |
| 176 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloaded 35/35 layers to GPU |
| 177 | +5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: VRAM used: 5869 MB |
| 178 | +``` |
0 commit comments