Model Manager consists of the following components;
server
: Serve gRPC requests and HTTP requestsloadedr
: Load open models to the system
loader
currently loads open models from Hugging Face, but we can extend that to support other locations.
Currently we have two flows to create new base models.
- Specify base models to be loaded in the configuration of
model-manager-loader
. - Send
CreateModel
RPC calls tomodel-manager-server
.
We initially had former flow and later introduced the latter flow.
When receiving a CreateModel
request, model-manager-server
takes the following steps:
model-manager-server
creates a new base model in the database. The model's loading status isREQUESTED
.model-manager-loader
lists models in theREQUESTED
loading status frommodel-manager-server
. When there is such a model, it starts downloading the model files and uploads them to an object store.- Once the upload completes,
model-manager-loader
makes aCreateBaseModel
RPC call tomodel-manager-server
. model-manager-server
receives theCreateBaseModel
RPC call and creates a base model in the database. Please note that the base model ID at this step can be different from the ID used in the original request as we convert "/" to "-" (e.g.,openai/whisper-large
toopenai-whisper-large
).model-manager-loader
makes anUpdateModelLoadingStatus
RPC call tomodel-manager-server
. The original base model is deleted from the database.
Run the following command:
docker-compose build
docker-compose up
You can access to the database or hit the HTTP endpoint:
docker exec -it <postgres container ID> psql -h localhost -U user --no-password -p 5432 -d model_manager
curl http://localhost:8080/v1/models
docker exec -it <aws-cli container ID> bash
export AWS_ACCESS_KEY_ID=llmariner-key
export AWS_SECRET_ACCESS_KEY=llmariner-secret
aws --endpoint-url http://minio:9000 s3 ls s3://llmariner
make build-server
./bin/server run --config config.yaml
config.yaml
has the following content:
httpPort: 8080
grpcPort: 8081
internalGrpcPort: 8082
objectStore:
s3:
pathPrefix: models
debug:
standalone: true
sqlitePath: /tmp/model_manager.db
You can then connect to the DB.
sqlite3 /tmp/model_manager.db
# Run the query inside the database.
insert into models
(model_id, tenant_id, created_at, updated_at)
values
('my-model', 'fake-tenant-id', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);
You can then hit the endpoint.
curl http://localhost:8080/v1/models
grpcurl -d '{"base_model": "base", "suffix": "suffix", "tenant_id": "fake-tenant-id"}' -plaintext localhost:8082 list llmariner.models.server.v1.ModelsInternalService/CreateModel
Run the following command and run loader
. Please note that it is better to run this on
an EC2 instance as it requires download and upload of large files.
python3 -m venv ./venv
source ./venv/bin/activate
pip install -U "huggingface_hub[cli]"
export AWS_PROFILE=<profile that has access to the bucket>
export HUGGING_FACE_HUB_TOKEN=<Hugging Face API key>
make build-loader
cat << EOF > config.yaml
objectStore:
s3:
endpointUrl: https://s3.us-west-2.amazonaws.com
region: us-west-2
bucket: llm-operator-models
pathPrefix: v1
baseModelPathPrefix: base-models
baseModels:
- google/gemma-2b
runOnce: true
downloader:
kind: huggingFace
huggingFace:
# Change this to your cache directory.
cacheDir: /home/ubuntu/.cache/huggingface/hub
debug:
standalone: true
EOF
./bin/loader run --config config.yaml
There might not be a GGUF file in Hugging Face repositories. If so, run the following command to convert:
pip install numpy
pip install torch
pip install sentencepiece
pip install safetensors
pip install transformers
MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir hf-model-dir
huggingface-cli download "${MODEL_NAME}" --local-dir=hf-model-dir
python3 convert_hf_to_gguf.py --outtype=f32 ./hf-model-dir --outfile model.gguf
mv model.gguf hf-model-dir/
aws s3 cp --recursive ./hf-model-dir s3://llm-operator-models/v1/base-models/"${MODEL_NAME}"
See ggml-org/llama.cpp#2948 and https://github.com/ollama/ollama/blob/main/docs/import.md.
make build-docker-convert-gguf
# Mount the volume where a original model is stored (without symlink).
docker run \
-it \
--entrypoint /bin/bash \
-v /Users/kenji/base-models:/base-models \
llm-operator/experiments-convert_gguf:latest
python convert.py /base-models --outfile google-gemma-2b-q8_0 --outtype q8_0
Here is another example:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cp
make llama-quantize
ORIG_MODEL_PATH=./hf-model-dir
python convert_hf_to_gguf.py ${ORIG_MODEL_PATH} --outtype f16 --outfile converted.bin
# See https://github.com/ggerganov/llama.cpp/discussions/406 to understand options like q4_0.
./llama-quantize converted.bin quantized.bin q4_0
MODEL_NAME=<target model name (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct-q4 )>
aws s3 cp quantized.bin s3://llm-operator-models/v1/base-models/"${MODEL_NAME}"/model.gguf