quantized-containerized-models is a collection of experiments and best practices for deploying optimized AI models in efficient, containerized environments. The goal is to showcase how modern techniques—quantization, containerization and continuous integration/deployment (CI/CD) can work together to deliver fast, lightweight, and production-ready model deployments.
- Quantization – Reduce model size and accelerate inference using techniques like
nf4
, int8, and sparsity. - Containerization – Package models with Cog, ensuring reproducible builds and smooth deployments.
- CI/CD Integration – Automated pipelines for linting, testing, building and deployment directly to Replicate.
- Deployment Tracking – Status Page for visibility into workflow health and deployment status.(TODO)
- Open Source – Fully licensed under Apache 2.0.
-
flux-fast-lora-hotswap: Built on the LoRA fast blog post, this deployment uses
flux.1-dev
models with two LoRAs that can be hot-swapped to reduce generation time and avoid graph breaks.- Optimized with
nf4
quantization andtorch.compile
for speedups. - Includes an Img2Img variant.
- Featured in the official Hugging Face blogpost.
- Source code.
- Optimized with
-
smollm3-3b-smashed: Uses Pruna to quantize and
torch.compile
the smollm3-3b model, enabling lower VRAM usage and faster generation.- Supports 16k token context windows and hybrid reasoning.
- Source code.
-
phi-4-reasoning-plus-unsloth: Accelerates Microsoft’s Phi-4 reasoning model with Unsloth, achieving faster inference and a smaller memory footprint.
-
gemma3-torchao-quant-sparse: Improves inference performance for Gemma-3-4B-IT using torchao int8 quantization combined with sparsity techniques such as granular and magnitude pruning.
This repository implements structured CI/CD pipelines that ensures quality, reliability, and smooth deployments:
- Code Quality –
flake8
,black
,isort
,ty
andbandit
checks. - Unit Testing – Covers core functions (
predict.py
), input/output validation, and error handling. (TODO) - Integration Testing – Build Cog containers, validate
cog.yaml
, run health checks, and test performance. (TODO)
- Automatic deployments to Replicate on completion of project.
- Staging-first workflow – Test in staging before production release.
- Semantic versioning for model releases and consistent Docker image tagging.
- Post-deployment validation using Replicate API: response latency, output quality and smoke tests.
- Status Page (GitHub Pages) – Automatically updated after each deployment with latest test results, deployment times, and model health.
This project is licensed under the Apache License 2.0.