Browser Agent

An AI-powered browser automation microservice built on the Kernel platform that uses browser-use for intelligent web browsing tasks.

Overview

The browser-agent microservice provides AI-powered browser automation capabilities, allowing you to control browsers using natural language instructions. It supports multiple LLM providers (Anthropic Claude, OpenAI GPT, Google Gemini, Azure OpenAI, Groq, and Ollama) and can handle complex multi-step web tasks including data extraction, form filling, file downloads, and CAPTCHA solving.

Features

AI-powered browser automation: Uses LLMs to intelligently control browsers and perform complex web tasks
Multi-step task execution: Decomposes complex requests into sub-tasks and executes them sequentially
Multi-provider LLM support: Works with Anthropic Claude, OpenAI GPT, Google Gemini, Azure OpenAI, Groq, and Ollama
File handling: Automatically downloads PDFs and other files, uploads them to cloud storage
CAPTCHA solving: Built-in capability to handle CAPTCHAs and similar challenges
Session management: Creates isolated browser sessions with proper cleanup
Trajectory tracking: Records and stores complete execution history for analysis
AI Gateway integration: Compatible with any AI gateway (Cloudflare, Azure, etc.) or direct provider APIs

Getting Started

mise - Development environment manager
Python 3.11+ (managed via mise)
Node.js with bun (for deployment tools, managed via mise)

# Install development tools
mise install

# Install Python dependencies
uv sync

# Copy environment template
cp .env.example .env

Edit your .env file with the required values:

# LLM Provider Configuration
# Option 1: Direct API access (no gateway) - providers use default endpoints
# Nothing required here - providers will use their default API endpoints!

# Option 2: With AI Gateway (Cloudflare example)
AI_GATEWAY_URL="https://gateway.ai.cloudflare.com/v1/{account_id}/ai-gateway"
AI_GATEWAY_HEADERS='{"cf-aig-authorization": "Bearer your-gateway-token"}'
ANTHROPIC_CONFIG='{"base_url": "${AI_GATEWAY_URL}/anthropic", "default_headers": ${AI_GATEWAY_HEADERS}}'
OPENAI_CONFIG='{"base_url": "${AI_GATEWAY_URL}/openai", "default_headers": ${AI_GATEWAY_HEADERS}}'
GEMINI_CONFIG='{"http_options": {"base_url": "${AI_GATEWAY_URL}/google-ai-studio", "headers": ${AI_GATEWAY_HEADERS}}}'

# Option 3: Provider-specific configurations
# Azure OpenAI
AZURE_OPENAI_CONFIG='{"azure_endpoint": "https://your-resource.openai.azure.com/", "api_version": "2024-02-01"}'

# Groq
GROQ_CONFIG='{"base_url": "https://api.groq.com/openai/v1"}'

# Ollama (local)
OLLAMA_CONFIG='{"base_url": "http://localhost:11434/v1"}'

# Kernel Platform (required)
KERNEL_API_KEY="sk_xxxxx"

# S3-compatible storage for file downloads (required)
S3_BUCKET="browser-agent"
S3_ACCESS_KEY_ID="your-access-key"
S3_ENDPOINT_URL="https://{account_id}.r2.cloudflarestorage.com"
S3_SECRET_ACCESS_KEY="your-secret-key"

# Optional Configuration
# Browser viewport size (default: 1440x900)
# VIEWPORT_SIZE='{"width": 1440, "height": 900}'

# Set to "debug" for verbose browser-use logging
# BROWSER_USE_LOGGING_LEVEL="info"

# Set to "false" to disable anonymous telemetry
# ANONYMIZED_TELEMETRY="false"

Test that everything is working:

# Start the development server
just dev

# In another terminal, check the service is running
curl http://localhost:8080/health

API Reference

Endpoint

POST /apps/browser-agent/actions/perform

Request Format

{
  "input": "Task description for the browser agent",
  "provider": "anthropic|gemini|openai|azure_openai|groq|ollama",
  "model": "claude-3-5-sonnet-20241022|gpt-4o|gemini-2.0-flash-exp|llama-3.3-70b-versatile",
  "api_key": "your-llm-api-key",
  "instructions": "Optional additional instructions",
  "stealth": true,
  "headless": false,
  "browser_timeout": 60,
  "max_steps": 100,
  "reasoning": true,
  "flash": false
}

Request Parameters

input (required): Natural language description of the task to perform
provider (required): LLM provider ("anthropic", "gemini", "openai", "azure_openai", "groq", or "ollama")
model (required): Specific model to use (e.g., "claude-3-sonnet-20240229")
api_key (required): API key for the LLM provider
instructions (optional): Additional context or constraints for the task
stealth (optional): Enable stealth mode to avoid detection (default: true)
headless (optional): Run browser in headless mode (default: false)
browser_timeout (optional): Browser session shutdown timeout in seconds (default: 60)
max_steps (optional): Maximum number of automation steps (default: 100)
reasoning (optional): Enable step-by-step reasoning (default: true)
flash (optional): Use faster execution mode (default: false)

Response Format

{
  "session": "browser-session-id",
  "success": true,
  "duration": 45.2,
  "result": "Task completion summary",
  "downloads": {
    "filename.pdf": "https://presigned-url",
    "data.csv": "https://presigned-url"
  }
}

Response Fields

session: Unique browser session identifier
success: Whether the task completed successfully
duration: Execution time in seconds
result: Summary of what was accomplished
downloads: Dictionary of downloaded files with presigned URLs

Examples

Basic Web Scraping

{
  "input": "Go to example.com and extract all the text content from the main article",
  "provider": "anthropic",
  "model": "claude-4-sonnet",
  "api_key": "sk-ant-xxxxx",
  "headless": true,
  "max_steps": 50
}

Complex Task with File Download

{
  "input": "Search for Python tutorials on Google and download the first PDF result",
  "instructions": "Make sure to verify the PDF is relevant before downloading",
  "provider": "openai",
  "model": "gpt-4.1",
  "api_key": "sk-xxxxx",
  "headless": false,
  "reasoning": true
}

Form Filling

{
  "input": "Fill out the contact form on example.com with name 'John Doe', email '[email protected]', and message 'Hello world'",
  "provider": "gemini",
  "model": "gemini-2.0-flash-exp",
  "api_key": "your-gemini-key",
  "stealth": true
}

Using Azure OpenAI

{
  "input": "Navigate to news.ycombinator.com and summarize the top 5 stories",
  "provider": "azure_openai",
  "model": "gpt-4o",
  "api_key": "your-azure-openai-key",
  "headless": true
}

Using Groq

{
  "input": "Search for 'climate change' on Wikipedia and extract the first paragraph",
  "provider": "groq",
  "model": "llama-3.3-70b-versatile",
  "api_key": "your-groq-key",
  "reasoning": true
}

Using Ollama (Local)

{
  "input": "Go to example.com and take a screenshot of the homepage",
  "provider": "ollama",
  "model": "llama3.2",
  "api_key": "not-required-for-ollama",
  "headless": false
}

Available Commands

This project uses just as a task runner. All commands are defined in the justfile.

Development Commands

just dev          # Run local development server on port 8000
just fmt          # Format and lint code with ruff (auto-fix issues)
just lint         # Check code formatting and linting (no auto-fix)

Deployment Commands

just deploy       # Deploy main.py to Kernel platform
just logs         # View browser-agent logs with follow mode

AI Tool Integration

just claude       # Run Claude Code CLI (setup and development assistant)
just gemini       # Run Google Gemini CLI

Kernel Platform Commands

just kernel <cmd> # Run any Kernel CLI command (e.g., 'just kernel status')

Deployment

The deployment process:

Runs formatting and linting checks
Deploys src/app.py to the Kernel platform
Service becomes available at the configured Kernel endpoint

Architecture

Core Components

src/app.py: Main Kernel app with browser-agent action. Creates browsers via kernel, instantiates Agent with custom session, runs tasks and returns trajectory results.
src/lib/browser/session.py: CustomBrowserSession that extends browser-use's BrowserSession, fixing viewport handling for CDP connections and setting fixed 1024x786 resolution.
src/lib/browser/models.py: BrowserAgentRequest model handling LLM provider abstraction (anthropic, gemini, openai, azure_openai, groq, ollama) with AI gateway integration.
src/lib/gateway.py: AI gateway configuration from environment variables.

Key Dependencies

browser-use>=0.7.2 - Web automation library providing Agent and BrowserSession
kernel>=0.11.0 - Platform for running the browser agent service
zenbase-llml>=0.4.0 - LLM templating used in task construction
pydantic>=2.10.6 - Data validation and serialization
boto3>=1.40.25 - AWS S3/R2 integration for file storage

Architecture Flow

Request received via Kernel platform
LLM client created based on provider/model (direct API or through AI Gateway)
Remote browser session established with custom configuration
browser-use Agent instantiated with reasoning capabilities
Task executed with intelligent planning and step-by-step execution
Files automatically uploaded to Cloudflare R2 storage
Trajectory and results returned with download links

Troubleshooting

Common Issues

Environment variables: Ensure all required environment variables are set
Browser timeout: Increase browser_timeout for complex tasks
File downloads: Check R2 bucket permissions and configuration
LLM provider errors: Verify API keys and model availability
Deployment issues: Ensure that the main entrypoint is in the root of the directory

Contributing

Format code: just fmt
Test changes locally: just dev
Deploy to staging: just deploy

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.claude		.claude
.cursor/rules		.cursor/rules
.githooks		.githooks
.github/workflows		.github/workflows
lib		lib
.env.example		.env.example
.gitignore		.gitignore
.mise.toml		.mise.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
justfile		justfile
local.py		local.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

zenbase-ai/browser-agent

Folders and files

Latest commit

History

Repository files navigation

Browser Agent

Overview

Features

Getting Started

API Reference

Endpoint

Request Format

Request Parameters

Response Format

Response Fields

Examples

Basic Web Scraping

Complex Task with File Download

Form Filling

Using Azure OpenAI

Using Groq

Using Ollama (Local)

Available Commands

Development Commands

Deployment Commands

AI Tool Integration

Kernel Platform Commands

Deployment

Architecture

Core Components

Key Dependencies

Architecture Flow

Troubleshooting

Common Issues

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages