Distributed LLM inference across multiple AWS instances with layer-wise sharding.
Before running the system, you MUST update the IP addresses in config.yaml:
# Network configuration
network:
# Set these to your actual instance IPs
instance1_ip: "YOUR_INSTANCE_1_PRIVATE_IP"
instance2_ip: "YOUR_INSTANCE_2_PRIVATE_IP"To find your instance IPs:
AWS Console
- Go to AWS EC2 Console
- Click "Instances" in left sidebar
- Select your instance
- Copy the Private IPv4 address from the details panel
- Repeat for your second instance
git clone <your-repo-url>
cd llm_p2p
./setup_env.shAdd inbound rule for both instances:
- Type: Custom TCP
- Port: 8000
- Source: 0.0.0.0/0
Instance 1 (Input Shard - Layers 0-2):
./setup_shard1.shInstance 2 (Output Shard - Layers 3-5):
./setup_shard2.shTest from ANY instance (no master node!):
# Test input shard directly
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello P2P", "max_length": 15}'
# Check peer discovery
curl http://localhost:8000/peers
# Check health
curl http://localhost:8000/healthFor a complete walkthrough, see the Jupyter notebook:
jupyter notebook examples/example.ipynbThe notebook demonstrates:
- Multi-instance setup and health checks
- Cross-instance peer discovery
- Distributed text generation
- P2P routing
- Shard 0: Layers 0-2 (Input + Embeddings)
- Shard 1: Layers 3-5 (Output + LM Head)
- No Master Node: True P2P - any server can handle requests
- Auto-Discovery: Shards find each other automatically
- Direct Communication: Shard-to-shard HTTP calls
- 2 AWS instances with Tesla T4 GPUs
- Ubuntu 20.04+
- Port 8000 open between instances
- 8GB+ RAM per instance