Model Serving from Scratch*

(using HF transformers)

FastAPI server for serving language and speech models with batched inference and streaming support.

Note

This project is a learning experiment and is not intended for production use.

Models

google/gemma-3-270m-it
openai/whisper-large-v3-turbo

Features

Text Generation: Google Gemma 3 270M with token streaming via Server-Sent Events
Speech-to-Text: OpenAI Whisper Large v3 Turbo (non-streaming only)
Batched Inference: Efficient processing of multiple requests
Independent Completion: Requests finish as soon as they're done (no batch stragglers)

Setup

Install uv, then:

uv install
source .venv/bin/activate
echo "HF_TOKEN=your_token" > .env
make dev

Usage

# Non-streaming
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"text": "What is AI?", "max_output_tokens": 100}'

# Streaming
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Write a story", "stream": true}'

Commands

make dev - Start FastAPI dev server
make format - Format code using ruff
make type-check - Type check using pyright

Current Status

✅ Basic inference
✅ Token streaming
✅ Request batching
✅ Concurrent inference
✅ Whisper for STT
✅ Temperature adjustment
❌ Continuous batching
✅ KV caching optimization - not really from scratch as i just used past_key_values from transformers :(

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
samples		samples
src		src
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
client.html		client.html
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Serving from Scratch*

Models

Features

Setup

Usage

Commands

Current Status

About

Uh oh!

Languages

biraj21/llm-server-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Model Serving from Scratch*

Models

Features

Setup

Usage

Commands

Current Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages