(using HF transformers)
FastAPI server for serving language and speech models with batched inference and streaming support.
Note
This project is a learning experiment and is not intended for production use.
- google/gemma-3-270m-it
 - openai/whisper-large-v3-turbo
 
- Text Generation: Google Gemma 3 270M with token streaming via Server-Sent Events
 - Speech-to-Text: OpenAI Whisper Large v3 Turbo (non-streaming only)
 - Batched Inference: Efficient processing of multiple requests
 - Independent Completion: Requests finish as soon as they're done (no batch stragglers)
 
Install uv, then:
uv install
source .venv/bin/activate
echo "HF_TOKEN=your_token" > .env
make dev# Non-streaming
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"text": "What is AI?", "max_output_tokens": 100}'
# Streaming
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"text": "Write a story", "stream": true}'make dev- Start FastAPI dev servermake format- Format code using ruffmake type-check- Type check using pyright
- ✅ Basic inference
 - ✅ Token streaming
 - ✅ Request batching
 - ✅ Concurrent inference
 - ✅ Whisper for STT
 - ✅ Temperature adjustment
 - ❌ Continuous batching
 - ✅ KV caching optimization - not really from scratch as i just used 
past_key_valuesfrom transformers :(