An AI-powered smart interview cropping system that detects speakers in videos and automatically focuses on whoever is actively speaking. VoiceVision uses advanced speaker diarization with TalkNet to create perfectly framed videos optimized for social media and presentations. It further enhances videos by generating accurate transcriptions using Whisper ASR, attributing speech segments to the identified speakers, and providing a searchable transcript interface.
- Real-time speaker detection in videos using audio-visual cues
- Smart cropping that follows active speakers with smooth transitions
- Automatic portrait-mode video generation (9:16) for social media
- Video annotation with speaker information and detection confidence
- Works with multiple speakers in the same video
- Supports various video sources including m3u8 streams
- Video history sidebar for easy access to previously processed videos
- Automatic Speech Recognition (ASR) using Whisper for video transcription.
- Speaker attribution in transcripts, linking spoken segments to detected speakers.
- Searchable transcript interface on the results page for easy content navigation.
The diagram below also illustrates the integration of Automatic Speech Recognition (ASR), speaker-to-transcript attribution, and the transcript search functionality.
graph TD
subgraph Frontend
A["Web Interface"] --> B["Upload Form"]
A --> C["Video URL Input"]
A --> D["Aspect Ratio Selection"]
A --> E["Video History Sidebar"]
A --> F["Results Page"]
end
subgraph Backend
G["Flask App"] --> H["Video Processing"]
G --> I["Storage Management"]
G --> J["History Management"]
subgraph "Processing Pipeline"
H --> K["Video Download"]
K --> L["Speaker Detection"]
L --> M["Face Detection"]
L --> N["Audio Analysis"]
M & N --> O["Speaker Diarization"]
O --> P["Smart Cropping"]
P --> Q["Cropped Video Generation"]
P --> R["Annotated Video Generation"]
Q & R --> S["Thumbnail Generation"]
O --> T1["ASR (Transcription)"]
T1 --> T2["Speaker Attribution to Transcript"]
end
subgraph "Storage"
I --> T["Task Directory"]
I --> U["Upload Directory"]
I --> V["History JSON"]
end
end
%% User flows
B --> |Upload Video| G
C --> |Process URL| G
D --> |Select Ratio| G
G --> |Processing Status| A
G --> |Store Results & Transcripts| T
G --> |Save to History| V
T --> |Serve Videos & Transcripts| F
V --> |Display History| E
E --> |Select Previous Video| F
%% Component details
classDef frontend fill:#d4f1f9,stroke:#05a3c7,stroke-width:2px
classDef backend fill:#ffe6cc,stroke:#f0ad4e,stroke-width:2px
classDef processing fill:#d9ead3,stroke:#6aa84f,stroke-width:2px
classDef storage fill:#fff2cc,stroke:#ffbf00,stroke-width:2px
class A,B,C,D,E,F frontend
class G,H,I,J,T1,T2 backend
%% T1, T2 are backend processing steps
class K,L,M,N,O,P,Q,R,S processing
class T,U,V storage
The application follows a clean 3-layer architecture:
- UI Layer: User interfaces for uploading, viewing results, and browsing history
- Config Layer: Configuration options for processing parameters
- Backend Layer: Core processing logic, storage, and history management
You can also run the VoiceVision web application using Docker. The official image is available on Docker Hub.
- Docker installed on your system.
Pull the latest image from Docker Hub:
docker pull mehdih7/voicevision-app:latest
To run the VoiceVision application, use the following command:
docker run -d -p HOST_PORT:5001 --name voicevision-container mehdih7/voicevision-app:latest
Explanation:
-d
: Runs the container in detached mode (in the background).-p HOST_PORT:5001
: Maps a port on your host machine (HOST_PORT
) to port5001
inside the container (where the VoiceVision Flask app listens).- For example, to access the application on
http://localhost:5005
on your host machine, use-p 5005:5001
.
- For example, to access the application on
--name voicevision-container
: Assigns a recognizable name to your running container.mehdih7/voicevision-app:latest
: Specifies the image to run.
Example (Accessing on host port 5005):
docker run -d -p 5005:5001 --name voicevision-container mehdih7/voicevision-app:latest
Once the container is running, open your web browser and navigate to:
http://localhost:HOST_PORT
(Replace HOST_PORT
with the port you chose in the docker run
command).
The VoiceVision application creates data in uploads/
and task/
directories inside the container. To persist this data if the container is removed, use Docker volumes:
docker run -d \
-p HOST_PORT:5001 \
-v /path/on/your/host/for/uploads:/app/uploads \
-v /path/on/your/host/for/tasks:/app/task \
--name voicevision-container \
mehdih7/voicevision-app:latest
Replace:
HOST_PORT
with your desired host port (e.g.,5005
)./path/on/your/host/for/uploads
with the actual path on your computer for uploads (e.g.,$(pwd)/voicevision_data/uploads
to create it in your current project directory)./path/on/your/host/for/tasks
with the actual path for task data (e.g.,$(pwd)/voicevision_data/tasks
).
Make sure these host directories exist, or Docker might create them with root ownership.
- Python 3.8+
- PyTorch
- OpenCV
- scipy
- python_speech_features
- m3u8
- ffmpeg (command line tool)
-
Clone this repository:
git clone https://github.com/simulamet-host/VoiceVision.git cd VoiceVision
-
Install dependencies:
pip install -r requirements.txt
-
Download model weights: Models should be placed in the
weights/
directory:talknet_speaker_v1.model
: TalkNet speaker detection models3fd_facedetection_v1.pth
: S3FD face detection model
To start the Flask web application locally:
python app.py
The application will then be accessible at http://localhost:5001
.
The core speaker diarization logic can also be imported and used as a Python module in your own scripts:
from demo_speaker_diarization import demo_speaker_diarization
task_id = 'my-interview-task'
video_url = 'https://your-video-url.m3u8' # Can be a local file too
output_path = f'task/{task_id}/interview_cropped.mp4'
# Generate a cropped video focusing on speakers
demo_speaker_diarization(task_id, video_url, output_path,
target_ratio=(9, 16), # For portrait mode
min_score=0.4) # Speaker detection threshold
VoiceVision combines face detection, audio analysis, and multi-modal fusion to identify the active speaker in each frame:
- Video Processing: Loads and processes video from various sources
- Face Detection: Detects and tracks faces across video frames
- Audio Analysis: Extracts speech features from the audio track
- Speaker Diarization: Matches audio to visual features to identify active speakers
- Transcription: The audio track is processed by an ASR model (Whisper) to generate a time-stamped transcript.
- Speaker Attribution: The generated transcript is aligned with the speaker diarization output to assign speaker labels to each segment of speech.
- Smart Cropping: Intelligently frames the active speaker with smooth transitions (based on speaker diarization data).
- Output Generation: Creates both a cropped video and an annotated version. The results page also allows searching through the transcript.
demo_speaker_diarization.py
: Main entry point for the modulecropper.py
: Smart video cropping around detected speakerscomponents/
: Neural network componentsface_detection/
: Face detection module using S3FDtalknet_modules/
: TalkNet speaker detection implementationencoders/
: Neural network encoders for audio and video
weights/
: Model weights directoryutils.py
: Utility functions for video processingtask/
: Output directory for processed videos
The speaker detection is based on TalkNet, which uses both audio and visual cues to detect active speakers in videos. The architecture consists of:
- Face detection using S3FD
- Audio feature extraction with a specialized audio encoder
- Visual feature extraction from detected faces
- Multi-modal fusion using attention mechanism
- Speaker activity detection with confidence scores
- TalkNet for the speaker diarization approach
- S3FD for the face detection model