-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Course
machine-learning-zoomcamp
Question
what would be the best practices for scaling this deployment if we wanted to handle many simultaneous requests from users?
Now that we have deployed the churn model using FastAPI and Docker, what would be the best practices for scaling this deployment if we wanted to handle many simultaneous requests from users?
Answer
Scaling a FastAPI + Docker deployment to handle many simultaneous requests involves several best practices:
- Use a production-ready ASGI server – Instead of running uvicorn directly, consider using Uvicorn with Gunicorn (gunicorn -k uvicorn.workers.UvicornWorker) to manage multiple worker processes. This allows your app to handle more concurrent requests.
gunicorn -k uvicorn.workers.UvicornWorker app.main:app --workers 4 --bind 0.0.0.0:8000
Container orchestration – For multiple instances, use tools like Docker Compose, Kubernetes, or AWS ECS to manage scaling, load balancing, and failover.
Horizontal scaling – Run multiple containers of your FastAPI app behind a load balancer. This distributes incoming requests across instances.
Caching and async processing – Use caching (e.g., Redis) for repeated predictions or heavy computations, and take advantage of FastAPI’s async endpoints for non-blocking requests.
Monitoring and logging – Implement monitoring (Prometheus, Grafana) and structured logging to detect bottlenecks or failures under high load.
✅ Summary: For scaling, combine production-grade server setup, multiple container instances, load balancing, caching, and monitoring to ensure your deployment can handle many simultaneous requests efficiently.
Checklist
- I have searched existing FAQs and this question is not already answered
- The answer provides accurate, helpful information
- I have included any relevant code examples or links