quantized-containerized-models

quantized-containerized-models is a collection of experiments and best practices for deploying optimized AI models in efficient, containerized environments. The goal is to showcase how modern techniques—quantization, containerization and continuous integration/deployment (CI/CD) can work together to deliver fast, lightweight, and production-ready model deployments.

Features

Quantization – Reduce model size and accelerate inference using techniques like nf4, int8, and sparsity.
Containerization – Package models with Cog, ensuring reproducible builds and smooth deployments.
CI/CD Integration – Automated pipelines for linting, testing, building and deployment directly to Replicate.
Deployment Tracking – Status Page for visibility into workflow health and deployment status.(TODO)
Open Source – Fully licensed under Apache 2.0.

🚀 Active Deployments

flux-fast-lora-hotswap: Built on the LoRA fast blog post, this deployment uses flux.1-dev models with two LoRAs that can be hot-swapped to reduce generation time and avoid graph breaks.
- Optimized with nf4 quantization and torch.compile for speedups.
- Includes an Img2Img variant.
- Featured in the official Hugging Face blogpost.
- Source code.
smollm3-3b-smashed: Uses Pruna to quantize and torch.compile the smollm3-3b model, enabling lower VRAM usage and faster generation.
- Supports 16k token context windows and hybrid reasoning.
- Source code.
phi-4-reasoning-plus-unsloth: Accelerates Microsoft’s Phi-4 reasoning model with Unsloth, achieving faster inference and a smaller memory footprint.
gemma3-torchao-quant-sparse: Improves inference performance for Gemma-3-4B-IT using torchao int8 quantization combined with sparsity techniques such as granular and magnitude pruning.

🔄 CI/CD Workflow

This repository implements structured CI/CD pipelines that ensures quality, reliability, and smooth deployments:

✅ Continuous Integration (CI)

Code Quality – flake8, black, isort, ty and bandit checks.
Unit Testing – Covers core functions (predict.py), input/output validation, and error handling. (TODO)
Integration Testing – Build Cog containers, validate cog.yaml, run health checks, and test performance. (TODO)

🚀 Continuous Deployment (CD) (TODO)

Automatic deployments to Replicate on completion of project.
Staging-first workflow – Test in staging before production release.
Semantic versioning for model releases and consistent Docker image tagging.
Post-deployment validation using Replicate API: response latency, output quality and smoke tests.

📊 Deployment Tracking (TODO)

Status Page (GitHub Pages) – Automatically updated after each deployment with latest test results, deployment times, and model health.

📜 License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
models		models
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

quantized-containerized-models

Features

🚀 Active Deployments

🔄 CI/CD Workflow

✅ Continuous Integration (CI)

🚀 Continuous Deployment (CD) (TODO)

📊 Deployment Tracking (TODO)

📜 License

About

Uh oh!

Releases

Packages

Languages

License

ParagEkbote/quantized-containerized-models

Folders and files

Latest commit

History

Repository files navigation

quantized-containerized-models

Features

🚀 Active Deployments

🔄 CI/CD Workflow

✅ Continuous Integration (CI)

🚀 Continuous Deployment (CD) (TODO)

📊 Deployment Tracking (TODO)

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages