This project provides a Docker-based framework for developing and evaluating Large Language Model (LLM) agents for Capture The Flag (CTF) competitions. It supports both file-based challenges and complex network-based challenges with multiple containerized services.
- Docker-based execution: Complete isolation and reproducible environments
- Multi-service challenges: Deploy web applications, databases, and other services
- Network-aware agents: Automatic detection of network vs file-based challenges
- LLM cost tracking: Full observability of API usage and costs
- Automated evaluation: Batch testing across multiple challenges
- Docker: Required for containerized challenge execution
- uv: Package manager for Python dependencies
Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
Install dependencies:
uv syncThe agent requires access to an LLM through a LiteLLM-compatible API.
- 
Copy the example environment file: cp .env.example .env 
- 
Edit the .envfile with your LiteLLM endpoint and API key:LITELLM_BASE_URL="https://your-litellm-proxy-url.com" LITELLM_API_KEY="your-litellm-api-key"
All challenges now run in Docker containers with automatic resource management.
To evaluate against all challenges:
uv run eval_agent.pyTo run a single challenge:
uv run eval_agent.py --challenge <challenge_name>Examples:
uv run eval_agent.py --challenge baby_cat              # File-based challenge
uv run eval_agent.py --challenge easy_sql_injection    # Network-based challenge  Results are saved in eval_results/ with detailed logs, costs, and LLM request tracking.
.
├── agent/
│   └── agent.py           # Main agent with network/file challenge detection
├── challenges/
│   ├── baby_cat/          # File-based challenge example
│   │   ├── artifacts/
│   │   │   └── myfile.txt
│   │   └── challenge.json
│   ├── easy_sql_injection/ # Network-based challenge example
│   │   ├── docker/         # Service container definitions
│   │   │   ├── Dockerfile
│   │   │   └── ...
│   │   ├── artifacts/
│   │   └── challenge.json
│   └── ...
├── docker/
│   └── agent/             # Agent container configuration
│       ├── Dockerfile     # Agent execution environment
│       └── run_agent.py   # Container entry point
├── eval_results/          # Timestamped evaluation results
├── helper/
│   ├── agent_boilerplate.py # Agent interface definition
│   ├── ctf_challenge.py   # Challenge models with service support
│   ├── docker_manager.py  # Docker orchestration and networking
│   └── llm_helper.py      # LLM integration with cost tracking
├── .env                   # Environment configuration (API keys)
├── eval_agent.py          # Main evaluation orchestrator
└── README.md              # This file
- Open agent/agent.py.
- The file contains a SimpleAgentclass that implements theAgentInterface.
- Modify the solve_challengemethod to implement your own strategy. The agent automatically detects:- File-based challenges: Access via challenge.working_folderwith artifacts
- Network-based challenges: Access via challenge.network_infowith service discovery
 
- File-based challenges: Access via 
- Use the CTFChallengeClientobject for challenge interaction and flag submission.
- Create a new directory in challenges/(e.g.,my_challenge/).
- Create challenge.json:{ "name": "My File Challenge", "description": "Find the hidden flag in the provided files.", "categories": ["misc", "forensics"], "flag": "flag{this_is_the_secret}", "flag_regex": "flag\\{\\S+\\}" }
- Create artifacts/subdirectory with challenge files.
- Create challenge directory and challenge.json:{ "name": "My Web Challenge", "description": "Exploit the vulnerable web application.", "categories": ["web", "sql"], "flag": "flag{sql_injection_success}", "flag_regex": "flag\\{\\S+\\}", "services": [ { "name": "webapp", "image": "my-webapp:latest", "ports": {"80/tcp": 8080}, "environment": {"FLAG": "flag{sql_injection_success}"} } ] }
- Create docker/subdirectory with service Dockerfile and application code.
- The agent will automatically discover services via Docker networking.
Each evaluation provides detailed observability:
- Per-challenge costs: Individual LLM usage tracking
- Request IDs: Full audit trail of API calls
- Usage analytics: Saved in eval_results/*/llm_usage.json
- Batch summaries: Total costs across multiple challenges
The system uses Docker containers for challenge execution with the following flow:
- Challenge Detection: Automatic identification of file vs network challenges
- Service Deployment: Docker containers for challenge services (if any)
- Network Creation: Isolated Docker network per challenge
- Agent Execution: Containerized agent with access to services
- Result Collection: LLM usage data and results extracted from containers
- Resource Cleanup: Automatic cleanup of containers and networks
For detailed architecture documentation, see docs/architecture.md.