A real-time voice conversation agent powered by OpenAI's GPT-5, featuring speech-to-text, text-to-speech, and video stream analysis capabilities.
Original Source: This project is based on the GPT-5 voice agent single-file example by @kwindla.
Original Post: X/Twitter announcement by @kwindla.
This project is based on the original single-file voice agent that was released when GPT-5 became publicly available. The original announcement highlighted the simplicity and power of GPT-5 for voice AI applications.
# Set your OpenAI API key
export OPENAI_API_KEY=sk_proj-your-api-key-here
# Run the voice agent
uv run gpt-5-voice-agent.py
Note: First-time setup takes about 30 seconds to install dependencies and begin processing audio/video.
For optimal voice AI performance, use these parameter settings:
service_tier: "priority" # Doubles cost but reduces latency
reasoning_effort: "minimal" # Faster responses for conversation
verbosity: "low" # Concise responses for voice
The "priority" service tier is recommended for latency-sensitive conversational applications, though it doubles the cost per token.
The original implementation uses a three-model approach:
- GPT-5: Main conversation model
- OpenAI Whisper: Speech-to-text transcription
- OpenAI TTS: Text-to-speech generation
- Pipecat.ai Guide: Comprehensive starter kit with both three-model and Realtime API approaches
- Voice AI & Voice Agents Primer: Technical deep dive into building production voice agents
- Original Code Gist: The original single-file implementation
OpenAI also released a new natively voice-to-voice Realtime model and API. For more information about using the Realtime API alongside the three-model approach, see the Pipecat.ai documentation above.
- Real-time Voice Conversations: Natural voice interaction with GPT-5
- Speech-to-Text: Automatic transcription using OpenAI's Whisper
- Text-to-Speech: Natural voice responses using OpenAI's TTS
- Video Stream Analysis: Visual understanding of camera feed
- Background Process Management: Easy start/stop/status control
- PID-based Process Control: Reliable process management
graph TB
subgraph "User Interface"
A[Browser Client] --> B[Pipecat Playground]
B --> C[WebRTC Connection]
end
subgraph "Voice Processing"
C --> D[Microphone Input]
C --> E[Camera Input]
D --> F[Speech-to-Text<br/>OpenAI Whisper]
E --> G[Video Analysis<br/>GPT-5 Vision]
end
subgraph "AI Processing"
F --> H[Text Input]
G --> I[Visual Context]
H --> J[GPT-5 Processing]
I --> J
J --> K[Text Response]
end
subgraph "Output Generation"
K --> L[Text-to-Speech<br/>OpenAI TTS]
L --> M[Audio Output]
M --> C
end
subgraph "Management"
N[start.sh] --> O[Background Process]
P[stop.sh] --> O
Q[status.sh] --> O
O --> R[app.pid]
O --> S[app.log]
end
style A fill:#e1f5fe
style B fill:#f3e5f5
style J fill:#fff3e0
style L fill:#e8f5e8
style N fill:#ffebee
style P fill:#ffebee
style Q fill:#ffebee
- Python 3.12+
- uv package manager
- OpenAI API key
Option 1: Automated Installation (Recommended)
# Clone the repository
git clone https://github.com/abdshomad/gpt-5-voice-agent
cd gpt-5-voice-agent
# Run the automated installer
./install.sh
Option 2: Manual Installation
# Clone the repository
git clone https://github.com/abdshomad/gpt-5-voice-agent
cd gpt-5-voice-agent
# Install dependencies
uv sync
# Set up environment
cp .env.example .env
nano .env # Add your OpenAI API key
Update the API key in .env
:
OPENAI_API_KEY=sk_proj-your-actual-api-key-here
# Start the application
./start.sh
# Access the app
# Open your browser and go to: http://localhost:7860/client
# Check status
./status.sh
# Stop the application
./stop.sh
The Pipecat Playground interface showing real-time voice conversation with GPT-5, including audio visualization, video stream, and conversation logs.
./install.sh
- Automated installation with dependency checking
- Verifies uv and Python installation
- Installs all dependencies using
uv sync
- Sets up environment file from template
- Makes scripts executable
- Provides next steps guidance
./start.sh
- Starts the voice agent in the background using
uv run
- Saves PID to
app.pid
- Logs output to
app.log
./stop.sh
- Gracefully stops the running app
- Removes PID file
- Frees port 7860
./status.sh
- Shows if app is running
- Displays recent logs
- Shows port status
Variable | Description | Required | Default |
---|---|---|---|
OPENAI_API_KEY |
Your OpenAI API key | β Yes | - |
DEBUG |
Enable debug logging | β No | false |
PORT |
Custom port | β No | 7860 |
The project includes a .env.example
file as a template:
# Copy the example file
cp .env.example .env
# Edit with your actual values
nano .env
Required Setup:
- Get your OpenAI API key from OpenAI Platform
- Replace
sk_proj-your-openai-api-key-here
with your actual API key - Save the file
The app uses pyproject.toml
for dependency management with the following packages:
numba==0.61.2
openai==1.99.1
python-dotenv
fastapi[all]
uvicorn
pipecat-ai[silero,webrtc,openai]
pipecat-ai-small-webrtc-prebuilt
Development Dependencies (optional):
pytest
- Testing frameworkblack
- Code formattingflake8
- Lintingmypy
- Type checking
- Start the app:
./start.sh
- Open browser: Navigate to http://localhost:7860/client
- Allow camera/microphone: Grant permissions when prompted
- Start talking: Begin your voice conversation with GPT-5
- Ask about video: Say "what can you see?" to analyze the camera feed
# Check if dependencies are installed
uv run python -c "import openai, pipecat_ai, fastapi; print('β
Dependencies ready!')"
# Test the application
uv run python gpt-5-voice-agent.py --help
# Check running status
./status.sh
# Run the automated installer (if not already run)
./install.sh
The agent can analyze your camera feed in real-time:
- Visual Questions: Ask "what do you see?" or "describe the video"
- Object Recognition: Identify objects in the camera view
- Scene Analysis: Understand the context of your environment
# Check what's using port 7860
lsof -i :7860
# Kill the process if needed
kill <PID>
- Check logs:
tail -f app.log
- Verify API key in
.env
- Ensure all dependencies are installed:
uv sync
- Check microphone permissions in browser
- Ensure microphone is not muted
- Try refreshing the browser page
- Check camera permissions in browser
- Ensure camera is not in use by other applications
- Try refreshing the browser page
- Ensure
.env
file exists and has correct API key - Check that
.env.example
was copied correctly - Verify API key format starts with
sk_proj-
gpt-5-voice-agent-2025/
βββ gpt-5-voice-agent.py # Main application
βββ pyproject.toml # Project configuration and dependencies
βββ install.sh # Automated installation script
βββ start.sh # Start script (uses uv run)
βββ stop.sh # Stop script
βββ status.sh # Status script
βββ .env # Environment variables (create from .env.example)
βββ .env.example # Environment template
βββ .gitignore # Git ignore rules
βββ INSTALL.md # Detailed installation guide
βββ README.md # This file
The project uses pyproject.toml
for modern Python packaging:
- Dependencies: All required packages are specified in
pyproject.toml
- Development Tools: Includes configuration for testing, linting, and formatting
- Build System: Uses
hatchling
for building and packaging - Installation: Can be installed with
pip install -e .
oruv sync
The project has been tested with the following dependencies:
- 94 packages installed successfully
- Core dependencies: openai, pipecat-ai, fastapi, uvicorn
- Audio processing: numba, av, aiortc, pyloudnorm
- Video processing: opencv-python, pillow
- Development tools: pytest, black, flake8, mypy (optional)
The install.sh
script provides automated setup:
- Dependency checking: Verifies
uv
and Python installation - Automatic installation: Uses
uv sync
for reliable dependency management - Environment setup: Creates
.env
from template - Verification: Tests core dependencies and application
- User guidance: Provides clear next steps
# Run directly with uv (recommended)
uv run gpt-5-voice-agent.py
# Or use the start script (also uses uv run)
./start.sh
# For development with auto-reload
uv run uvicorn gpt-5-voice-agent:app --reload --host 0.0.0.0 --port 7860
# Real-time logs
tail -f app.log
# Recent logs
tail -20 app.log
# If running directly
Ctrl+C
# If running with start script
./stop.sh
- API Key: Never commit your
.env
file to version control - Environment Template: Use
.env.example
as a safe template - Permissions: The app requires camera and microphone access
- Network: Runs locally on localhost:7860
This project is for educational and personal use. Please ensure you comply with OpenAI's usage policies.
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
If you encounter issues:
- Check the logs:
tail -f app.log
- Verify your OpenAI API key in
.env
- Ensure all dependencies are installed
- Check browser permissions for camera/microphone
- Verify
.env
was created from.env.example
Happy Voice Chatting! π€β¨