Vision-Driven Autonomous Robot Control System
OpticXT is a high-performance, real-time robot control system combining computer vision, audio processing, and multimodal AI. Powered by GPU-accelerated ISQ quantization on NVIDIA hardware for edge deployment.
OpticXT transforms visual and audio input into contextual understanding and immediate robotic actions through GPU-accelerated inference.
- ISQ Quantization: In-Situ Quantization with Q4K precision for optimal speed/quality balance
- CUDA Acceleration: Full GPU acceleration on NVIDIA RTX 4090 and compatible hardware
- Fast Model Loading: 22-second model loading with optimized memory footprint
- Multimodal Support: Text, image, and audio processing with
unsloth/gemma-3n-E4B-it
vision model - Real-time Inference: 36-38% GPU utilization with 6.8GB VRAM usage for continuous processing
- OpenAI-Compatible APIs: Support for GPT-4o, Claude, Groq, and custom endpoints
- Minimal Hardware: Run on Raspberry Pi Zero 2 W with remote inference
- Vision Support: Full multimodal capabilities via remote models
- Seamless Integration: Same interface for local and remote models
- Provider Flexibility: Switch between OpenAI, Anthropic, Groq, or self-hosted models
- Real camera input with automatic device detection and hardware fallback
- Optimized object detection with spam prevention (max 10 high-confidence objects)
- Concise scene descriptions: "Environment contains: person, 9 rectangular objects"
- Real-time audio input from microphone with voice activity detection
- Text-to-speech output with configurable voice options
- OpenAI Tool Calling: Modern function call interface for precise robot actions
- Action-First Architecture: Direct translation of visual context to robot commands
- Context-Aware Responses: Real model computation influences tool calls based on multimodal input
- Safety Constraints: Built-in collision avoidance and human detection
- Hardware Integration: Real motor control and sensor feedback loops
- CUDA Detection: Automatic GPU detection with CPU fallback
- Memory Efficient: ISQ reduces memory footprint compared to full-precision models
- Edge-Ready: Optimized for NVIDIA Jetson Nano and desktop GPU deployment
- Real-time Pipeline: Sub-second inference with continuous processing
OpticXT follows a clean, organized structure that separates core functionality from supporting files:
src/
- Main source codemain.rs
- Application entry point and CLImodels.rs
- AI model management (local/remote)config.rs
- Configuration managementpipeline.rs
- Vision-action processing pipelinecamera.rs
- Camera input handlingaudio.rs
- Audio input/output processingvision_basic.rs
- Computer vision processingremote_model.rs
- OpenAI-compatible API client
tests/
- Comprehensive test suiteexamples/
- Usage examples and configurationsscripts/
- Utility and testing scripts
docs/
- Comprehensive documentationconfig.toml
- Main configuration filemodels/
- Model storage directoryprompts/
- System prompts and templates
This organization provides clear separation between operational code, testing, examples, and documentation, making the project easy to navigate and maintain.
1. **Visual Input**: Real-time camera stream with automatic device detection
2. **Object Detection**: Optimized computer vision with spam prevention (max 10 objects)
3. **AI Processing**: GPU-accelerated ISQ inference with multimodal understanding
4. **Action Output**: OpenAI-style function calls for immediate robot execution
### ISQ Quantization System
OpticXT uses In-Situ Quantization (ISQ) for optimal performance:
- **Q4K Precision**: 4-bit quantization with optimal speed/quality balance
- **In-Memory Processing**: Weights quantized during model loading (reduced memory footprint)
- **GPU Acceleration**: Full CUDA support with 36-38% GPU utilization on RTX 4090
- **Fast Loading**: 22-second model initialization vs. slower UQFF alternatives
### Command Execution Framework
Building on modern OpenAI tool calling, OpticXT translates model outputs into robot actions:
- Physical movements and navigation commands
- Sensor integration and feedback loops
- Environmental awareness and safety constraints
- Audio/visual output generation
## Architecture Philosophy
OpticXT represents a paradigm shift from conversational AI to action-oriented intelligence. By eliminating the conversational layer and focusing purely on vision-driven decision-making, we achieve the low-latency response times critical for real-world robotics applications.
The system acts as a remote AGiXT agent, maintaining compatibility with the broader ecosystem while operating independently on edge hardware. This hybrid approach enables sophisticated behaviors through local processing while retaining the ability to offload complex tasks when needed.
## Use Cases
- Autonomous navigation in dynamic environments
- Real-time object interaction and manipulation
- Surveillance and monitoring applications
- Assistive robotics with visual understanding
- Industrial automation with adaptive behavior
## Getting Started
### Prerequisites
#### System Requirements
- Any Linux system with camera and microphone (for video chat mode)
- NVIDIA Jetson Nano (16GB) or Go2 robot (for full robot mode)
- USB camera, webcam, or CSI camera module
- Microphone and speakers/headphones for audio
- Rust 1.70+ installed
#### Dependencies
```bash
# Ubuntu/Debian - Basic dependencies
sudo apt update
sudo apt install -y build-essential cmake pkg-config
# Audio system dependencies (required)
sudo apt install -y libasound2-dev portaudio19-dev
# TTS support (required for voice output)
sudo apt install -y espeak espeak-data libespeak-dev
# Optional: Additional audio codecs
sudo apt install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev
- Clone the repository
git clone https://github.com/Josh-XT/OpticXT.git
cd OpticXT
- Setup CUDA Environment (Required for GPU acceleration)
For PowerShell (Windows/WSL):
# Set CUDA environment variables
$env:CUDA_ROOT = "/usr/local/cuda-12.5"
$env:PATH = "$env:CUDA_ROOT/bin:$env:PATH"
$env:LD_LIBRARY_PATH = "$env:CUDA_ROOT/lib64:$env:LD_LIBRARY_PATH"
# Or run the setup script
./setup_cuda.ps1
For Bash/Zsh (Linux/macOS):
# Set CUDA environment variables
export CUDA_ROOT="/usr/local/cuda-12.5"
export PATH="$CUDA_ROOT/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_ROOT/lib64:$LD_LIBRARY_PATH"
# Or run the setup script
source ./setup_cuda.sh
- Build with CUDA support for GPU acceleration
# For NVIDIA GPU acceleration (recommended) - with CUDA environment
cargo build --release
# For CPU-only mode (fallback) - no CUDA dependencies
cargo build --release --no-default-features
- The system uses ISQ quantization with automatic model download
The system automatically downloads and quantizes the unsloth/gemma-3n-E4B-it
model:
- Model: unsloth/gemma-3n-E4B-it (vision-capable, no authentication required)
- Quantization: ISQ Q4K (in-situ quantization during loading)
- Loading Time: ~22 seconds with GPU acceleration
- Memory Usage: ~6.8GB VRAM on RTX 4090
Note: Models are downloaded automatically from HuggingFace on first run. No manual model installation required.
Edit config.toml
to match your setup:
[vision]
width = 640 # Camera resolution
height = 480
confidence_threshold = 0.5
[audio]
input_device = "default" # Microphone device
output_device = "default" # Speaker device
voice = "en" # TTS voice language
enable_vad = true # Voice activity detection
[model]
model_path = "models/gemma-3n-E4B-it-Q4_K_M.gguf"
temperature = 0.7 # Lower = more deterministic
# Remote model configuration (optional)
# Uncomment to use remote API instead of local model
# [model.remote]
# base_url = "https://api.openai.com/v1"
# api_key = "your-api-key-here"
# model_name = "gpt-4o"
# supports_vision = true
[performance]
use_gpu = true # Set to false if no CUDA
processing_interval_ms = 100 # Adjust for performance
OpticXT supports remote model inference via OpenAI-compatible APIs, enabling deployment on low-power hardware like Raspberry Pi Zero 2 W. When a remote model is configured, OpticXT will use it instead of local inference, dramatically reducing hardware requirements while maintaining full functionality.
Supported Remote Providers:
- OpenAI: GPT-4o with vision support for high-quality multimodal inference
- Groq: Ultra-fast inference with Llama models (text-only)
- Anthropic: Claude models via OpenAI-compatible endpoints
- Local APIs: LM Studio, Ollama, or any OpenAI-compatible server
Example Remote Configurations:
# OpenAI GPT-4o with vision
[model.remote]
base_url = "https://api.openai.com/v1"
api_key = "your-openai-key"
model_name = "gpt-4o"
supports_vision = true
# Groq (very fast, text-only)
[model.remote]
base_url = "https://api.groq.com/openai/v1"
api_key = "your-groq-key"
model_name = "llama-3.1-70b-versatile"
supports_vision = false
# Local LM Studio server
[model.remote]
base_url = "http://localhost:1234/v1"
api_key = "not-needed"
model_name = "local-model"
supports_vision = false
Benefits of Remote Models:
- Minimal Hardware: Run on Pi Zero 2 W or any low-power device
- No GPU Required: Offload inference to powerful remote servers
- Latest Models: Access to cutting-edge models without local storage
- Scalability: Handle multiple robot instances without per-device model loading
OpticXT includes comprehensive examples and utility scripts:
Configuration Examples (examples/
):
examples/remote_model_examples.toml
- Remote model configurations for various providersexamples/example_api_client.py
- Python client demonstrating API usage
Utility Scripts (scripts/
):
scripts/test_api.sh
- Test API endpointsscripts/debug_cuda.sh
- CUDA environment debuggingscripts/demo.sh
- System demonstrationscripts/test_model_performance.sh
- Performance benchmarking
OpticXT runs as an autonomous robot control system with GPU-accelerated AI inference:
# Run with CUDA acceleration (recommended)
cargo run --release --features cuda -- --verbose
# Monitor GPU utilization (separate terminal)
watch -n 1 nvidia-smi
# Check real-time inference performance
cargo run --release --features cuda -- --verbose 2>&1 | grep "GPU Utilization\|GPU Memory"
USAGE:
opticxt [OPTIONS]
OPTIONS:
-c, --config <CONFIG> Configuration file path [default: config.toml]
-d, --camera-device <DEVICE> Camera device index [default: 0]
-m, --model-path <PATH> Override model path from config
--video-chat Run in video chat/assistant mode
--chat-mode <MODE> Chat mode: assistant, monitoring [default: assistant]
-v, --verbose Enable verbose logging with GPU monitoring
--benchmark Run model performance benchmark
--benchmark-iterations <N> Number of benchmark iterations [default: 50]
--api-server Start API server mode
--api-port <PORT> API server port [default: 8080]
-h, --help Print help information
Note: All test commands have been moved to standard Cargo test commands. Use cargo test
instead of command-line test flags.
- Model Loading: ~22 seconds (ISQ quantization)
- GPU Memory Usage: ~6.8GB VRAM / 24.5GB total (28% utilization)
- Inference Speed: 36-38% GPU utilization during processing
- Object Detection: Max 10 high-confidence objects per frame
- Response Time: 3-6 seconds per inference cycle
After installation, run the system:
# Start in video chat mode (default)
cargo run --release --features cuda
# Run with custom config
cargo run --release --features cuda -- --config custom.toml --camera-device 1
OpticXT can also run as a REST API server for integration with web applications, mobile apps, or other services:
# Start API server on default port 8080
cargo run --release --features cuda -- --api-server
# Start on custom port
cargo run --release --features cuda -- --api-server --api-port 3000
# CPU-only mode (no CUDA required)
cargo run --release --no-default-features -- --api-server
API Endpoint: POST /v1/inference
- Accepts multipart form data with optional text, images, or video files
- Returns JSON responses with model-generated text
- Supports real-time task status monitoring
- See
API_DOCUMENTATION.md
for detailed usage examples
Test the API:
# Simple text inference
curl -X POST http://localhost:8080/v1/inference -F "text=What do you see?"
# Image analysis
curl -X POST http://localhost:8080/v1/inference -F "text=Describe this image" -F "[email protected]"
For robot control mode, edit config.toml
to enable hardware integration.
Quick Test: Verify your setup with tests:
# Run basic functionality tests
cargo test --test test_simple
# Test with real camera (if available)
cargo test --test test_camera_vision
- Any Linux system with USB ports
- Webcam or built-in camera
- Microphone and speakers/headphones
- 4GB RAM minimum (8GB recommended)
- Rust 1.70+ toolchain
- NVIDIA Jetson Nano (16GB) or Go2 robot platform
- CSI or USB camera
- Microphone and speaker system
- Motor controllers and actuators
- Optional: LiDAR sensors (falls back to simulation)
OpticXT includes a complete real audio pipeline:
- Input: Real-time microphone capture with voice activity detection
- Output: Text-to-speech synthesis with multiple voice options
- Processing: Audio filtering and noise reduction
- Fallback: Graceful degradation when audio hardware is unavailable
Flexible camera support with automatic detection:
- USB Cameras: Automatic detection of any UVC-compatible camera
- Built-in Cameras: Laptop/desktop integrated cameras
- CSI Cameras: Jetson Nano camera modules
- Fallback: Simulation mode when no camera is detected
OpticXT includes comprehensive testing capabilities organized in the tests/
directory:
# Run all tests using standard Cargo commands
cargo test
# Run unit tests only
cargo test --lib
# Run integration tests
cargo test --test integration_tests
# Run specific test files
cargo test --test test_simple
cargo test --test test_multimodal
cargo test --test test_camera_vision
Tests are organized in the following structure:
tests/integration_tests.rs
- Main integration teststests/test_simple.rs
- Basic text inference teststests/test_multimodal.rs
- Multimodal (text + image + audio) teststests/test_image.rs
- Image processing teststests/test_camera_vision.rs
- Real camera vision teststests/test_remote_model.rs
- Remote model configuration teststests/test_vision_flow.rs
- Vision pipeline flow teststests/test_vision_pipeline.rs
- Vision processing pipeline tests
- Start with basic tests:
cargo test --lib
for unit tests - Test specific components:
cargo test --test test_simple
for text inference - Test camera integration:
cargo test --test test_camera_vision
(requires camera) - Test multimodal features:
cargo test --test test_multimodal
- Test remote models:
cargo test --test test_remote_model
- Full integration:
cargo test --test integration_tests
The scripts/
directory contains utility scripts for testing and development:
scripts/test_api.sh
- API endpoint testingscripts/test_model_performance.sh
- Model performance benchmarksscripts/debug_cuda.sh
- CUDA debugging utilitiesscripts/demo.sh
- System demonstration script
The examples/
directory contains practical examples and configurations:
examples/example_api_client.py
- Python API client exampleexamples/remote_model_examples.toml
- Remote model configuration examples
For detailed remote model configuration examples, see examples/remote_model_examples.toml
.
- Model Loading: 15-25 seconds depending on hardware
- Text Generation: 50-200ms per response
- Image Processing: 200-800ms per image
- Audio Processing: 100-500ms per audio segment
- Tool Call Generation: Valid JSON format with proper function structure
- CUDA Not Detected:
- Verify drivers:
nvidia-smi
- Rebuild:
cargo clean && cargo build --release --features cuda
- Verify drivers:
- Ensure NVIDIA drivers are installed and up to date
- Verify CUDA toolkit installation (see
CUDA_BUILD_GUIDE.md
) - Check that your GPU supports the CUDA version
- Try running without CUDA features first:
cargo run --release
- Verify internet connection for model downloads
- Check available disk space (models can be several GB)
- Ensure sufficient system memory (8GB+ recommended)
- Clear model cache if corruption suspected:
rm -rf ~/.cache/huggingface
- No Camera Detected:
# Check available cameras ls /dev/video* v4l2-ctl --list-devices # Fix permissions sudo usermod -a -G video $USER # Log out and back in
- Audio Issues:
# Check audio devices aplay -l # List playback devices arecord -l # List capture devices # Test microphone arecord -d 5 test.wav && aplay test.wav # Fix audio permissions sudo usermod -a -G audio $USER
- For multiple cameras, try different device indices (0, 1, 2...)
OpticXT successfully loads and runs AI models with real neural network inference:
# 1. Current Status Check
ls -lh models/
# Should show both gemma-3n-E4B-it-Q4_K_M.gguf and tokenizer.json
# 2. What's Working:
# ✅ Model file loading (GGUF format)
# ✅ Tokenizer loading and text processing
# ✅ Model architecture initialization
# ✅ Neural network forward pass with real inference
# ✅ Real token generation from model logits
# ✅ Context-aware function call output generation from actual model output
# ✅ Complete removal of all hardcoded/simulation fallbacks
# 3. Current Behavior:
# - Models load successfully with real tokenizer and UQFF quantization
# - Real model inference with mistral.rs and multimodal support (text/image/audio)
# - OpenAI-style function calls generated from genuine model output
# - System fails gracefully with clear error messages when models unavailable
# - All functionality (camera, audio, movement) works with real hardware input
# - NO hardcoded responses or simulation fallbacks whatsoever
# 4. Expected Log Messages:
# ✅ "✅ Successfully loaded HuggingFace Gemma 3n model with multimodal support"
# ✅ "Real multimodal model generated text in XXXms"
# ✅ "Running model inference with X modalities"
# ❌ "Model inference timed out after 180 seconds" (when model performance issues)
# 5. Error Handling:
# The system now properly fails with informative errors when:
# - Model files are missing or corrupted
# - Tokenizer cannot be loaded
# - Real inference fails
# This ensures complete authenticity - no fake responses under any circumstances
Current Status: The system uses exclusively real neural network inference with genuine GGUF model loading and authentic tokenizer processing. All simulation logic, hardcoded responses, and fallback mechanisms have been completely removed. The system will only operate with actual model inference or fail gracefully with clear error messages.
# Install missing dependencies
sudo apt install -y build-essential cmake pkg-config
sudo apt install -y libasound2-dev portaudio19-dev
sudo apt install -y espeak espeak-data libespeak-dev
# Clean and rebuild
cargo clean
cargo build --release
# Monitor system resources
htop
# Watch for CPU/memory usage
# Adjust processing interval in config.toml
# Increase processing_interval_ms for lower resource usage
Enable comprehensive logging:
RUST_LOG=debug cargo run -- --verbose
This shows:
- Camera detection and initialization
- Audio device enumeration
- Model loading status (real or simulation)
- Frame processing times
- Audio input/output status
- Error details and fallback triggers
OpticXT is built around real hardware components with intelligent fallbacks:
- Primary: Real camera capture via nokhwa library
- Fallback: Simulated visual input when no camera detected
- Support: USB, CSI, and built-in cameras with automatic detection
- Input: Real microphone capture via cpal library
- Output: Text-to-speech via tts/espeak integration
- Processing: Voice activity detection and audio filtering
- Fallback: Silent operation when audio hardware unavailable
- Model Loading: Successfully loads UQFF models with mistral.rs VisionModelBuilder
- Real Inference: Full multimodal neural network inference with text, image, and audio support
- Context-Aware Responses: Real model computation influences function call output based on multimodal input
- Tool Call Generation: OpenAI-style function calls generated from actual model outputs
- Intelligent Fallback: Graceful degradation when models unavailable
Camera → Vision Processing → AI Model → TTS Response
↑ ↓
Microphone ← Audio Processing ← User Interaction
Sensors → Context Assembly → AI Decision → Function Calls
↑ ↓
Environment ← Robot Actions ← Motor Control ← Tool Call Output
OpticXT generates OpenAI-style function calls for precise robot control:
[{
"id": "call_1",
"type": "function",
"function": {
"name": "move",
"arguments": "{\"direction\": \"forward\", \"distance\": 1.0, \"speed\": \"slow\", \"reasoning\": \"Moving forward to investigate detected object\"}"
}
}]
[{
"id": "call_1",
"type": "function",
"function": {
"name": "speak",
"arguments": "{\"text\": \"I can see someone approaching\", \"voice\": \"default\", \"reasoning\": \"Alerting about detected human presence\"}"
}
}]
[{
"id": "call_1",
"type": "function",
"function": {
"name": "analyze",
"arguments": "{\"target\": \"obstacle\", \"detail_level\": \"detailed\", \"reasoning\": \"Need to assess navigation path\"}"
}
}]
- Integrate Enhanced Model Support: Add Llama variants and other state-of-the-art models
- Add ROS2 Integration: Advanced robotics framework support for complex robot control
- Enhance Real-time Audio: Improved speech recognition and audio processing capabilities
- Create Web Interface: Browser-based control panel for remote operation
- Add Docker Support: Containerized deployment for easier setup and scaling
- Enable Multi-Robot Coordination: Support for controlling multiple robots simultaneously
- Install dependencies (audio system required)
- Build project:
cargo build --release
- Run:
cargo run --release
- Mode: Automatically runs as video chat assistant
- Flash Jetson with Ubuntu 20.04
- Install Rust toolchain
- Install system dependencies (including CUDA if available)
- Build with optimizations:
cargo build --release
- Configure hardware in
config.toml
- Deploy: Copy binary and config to target system
- Cross-compile for ARM64 architecture
- Install on Go2 via SDK deployment tools
- Configure sensors for robot hardware
- Enable robot mode in configuration
- Test control commands before full deployment
✅ GPU-Accelerated AI: ISQ quantization with CUDA acceleration on NVIDIA RTX 4090
✅ Real Camera Input: Works with any USB/CSI/built-in camera
✅ Real Audio I/O: Microphone input and TTS output
✅ Multimodal AI: unsloth/gemma-3n-E4B-it model with text, image, and audio support
✅ OpenAI Tool Calling: Robot control commands via modern function call interface
✅ Optimized Vision: Spam prevention with max 10 high-confidence objects per frame
✅ Hardware Auto-Detection: Real hardware integration with CUDA detection and CPU fallback
✅ Edge Deployment Ready: Optimized for NVIDIA Jetson and desktop GPU platforms
✅ Production Ready: 22-second model loading, 36-38% GPU utilization, 6.8GB VRAM usage
Performance Status: Full GPU acceleration achieved with ISQ quantization. System loads unsloth/gemma-3n-E4B-it
model in 22 seconds, utilizes 28% of RTX 4090 VRAM, and processes inference at 36-38% GPU utilization with optimized vision processing.
We welcome contributions! Please see our contributing guidelines for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Development setup
For detailed documentation, see:
- API Documentation - Complete API reference
- CUDA Build Guide - GPU setup instructions
- Remote Model Implementation - Remote API integration guide
⭐ Found OpticXT useful? Star this repository to show your support!
❓ Have questions? Open an issue - we're here to help!
This project is licensed under the MIT License - see the LICENSE file for details.
If OpticXT has helped you, consider supporting development:
- ⭐ Star the repo to show your support!
- 💬 Join discussions or report issues on GitHub
- ☕ Support development: CA:
Ga9P2TZcxsHjYmXdEyu9Z7wL1QAowjBAZwRQ41gBbonk
Real-time robot control with ISQ quantization and CUDA acceleration - from autonomous navigation to precise manipulation.