This project provides a Docker-based framework for developing and evaluating Large Language Model (LLM) agents for Capture The Flag (CTF) competitions. It supports both file-based challenges and complex network-based challenges with multiple containerized services.
- Docker-based execution: Complete isolation and reproducible environments
- Multi-service challenges: Deploy web applications, databases, and other services
- Network-aware agents: Automatic detection of network vs file-based challenges
- LLM cost tracking: Full observability of API usage and costs
- Automated evaluation: Batch testing across multiple challenges
- Docker: Required for containerized challenge execution
- uv: Package manager for Python dependencies
Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
Install dependencies:
uv syncThe agent requires access to an LLM through a LiteLLM-compatible API.
-
Copy the example environment file:
cp .env.example .env
-
Edit the
.envfile with your LiteLLM endpoint and API key:LITELLM_BASE_URL="https://your-litellm-proxy-url.com" LITELLM_API_KEY="your-litellm-api-key"
All challenges now run in Docker containers with automatic resource management.
To evaluate against all challenges:
uv run eval_agent.pyTo run a single challenge:
uv run eval_agent.py --challenge <challenge_name>Examples:
uv run eval_agent.py --challenge baby_cat # File-based challenge
uv run eval_agent.py --challenge easy_sql_injection # Network-based challenge Results are saved in eval_results/ with detailed logs, costs, and LLM request tracking.
.
├── agent/
│ └── agent.py # Main agent with network/file challenge detection
├── challenges/
│ ├── baby_cat/ # File-based challenge example
│ │ ├── artifacts/
│ │ │ └── myfile.txt
│ │ └── challenge.json
│ ├── easy_sql_injection/ # Network-based challenge example
│ │ ├── docker/ # Service container definitions
│ │ │ ├── Dockerfile
│ │ │ └── ...
│ │ ├── artifacts/
│ │ └── challenge.json
│ └── ...
├── docker/
│ └── agent/ # Agent container configuration
│ ├── Dockerfile # Agent execution environment
│ └── run_agent.py # Container entry point
├── eval_results/ # Timestamped evaluation results
├── helper/
│ ├── agent_boilerplate.py # Agent interface definition
│ ├── ctf_challenge.py # Challenge models with service support
│ ├── docker_manager.py # Docker orchestration and networking
│ └── llm_helper.py # LLM integration with cost tracking
├── .env # Environment configuration (API keys)
├── eval_agent.py # Main evaluation orchestrator
└── README.md # This file
- Open
agent/agent.py. - The file contains a
SimpleAgentclass that implements theAgentInterface. - Modify the
solve_challengemethod to implement your own strategy. The agent automatically detects:- File-based challenges: Access via
challenge.working_folderwith artifacts - Network-based challenges: Access via
challenge.network_infowith service discovery
- File-based challenges: Access via
- Use the
CTFChallengeClientobject for challenge interaction and flag submission.
- Create a new directory in
challenges/(e.g.,my_challenge/). - Create
challenge.json:{ "name": "My File Challenge", "description": "Find the hidden flag in the provided files.", "categories": ["misc", "forensics"], "flag": "flag{this_is_the_secret}", "flag_regex": "flag\\{\\S+\\}" } - Create
artifacts/subdirectory with challenge files.
- Create challenge directory and
challenge.json:{ "name": "My Web Challenge", "description": "Exploit the vulnerable web application.", "categories": ["web", "sql"], "flag": "flag{sql_injection_success}", "flag_regex": "flag\\{\\S+\\}", "services": [ { "name": "webapp", "image": "my-webapp:latest", "ports": {"80/tcp": 8080}, "environment": {"FLAG": "flag{sql_injection_success}"} } ] } - Create
docker/subdirectory with service Dockerfile and application code. - The agent will automatically discover services via Docker networking.
Each evaluation provides detailed observability:
- Per-challenge costs: Individual LLM usage tracking
- Request IDs: Full audit trail of API calls
- Usage analytics: Saved in
eval_results/*/llm_usage.json - Batch summaries: Total costs across multiple challenges
The system uses Docker containers for challenge execution with the following flow:
- Challenge Detection: Automatic identification of file vs network challenges
- Service Deployment: Docker containers for challenge services (if any)
- Network Creation: Isolated Docker network per challenge
- Agent Execution: Containerized agent with access to services
- Result Collection: LLM usage data and results extracted from containers
- Resource Cleanup: Automatic cleanup of containers and networks
For detailed architecture documentation, see docs/architecture.md.