VoiceVision: AI-Powered Speaker-Aware Cropping and Content Indexing for Multi-Speaker Videos

An AI-powered smart interview cropping system that detects speakers in videos and automatically focuses on whoever is actively speaking. VoiceVision uses advanced speaker diarization with TalkNet to create perfectly framed videos optimized for social media and presentations. It further enhances videos by generating accurate transcriptions using Whisper ASR, attributing speech segments to the identified speakers, and providing a searchable transcript interface.

Features

Real-time speaker detection in videos using audio-visual cues
Smart cropping that follows active speakers with smooth transitions
Automatic portrait-mode video generation (9:16) for social media
Video annotation with speaker information and detection confidence
Works with multiple speakers in the same video
Supports various video sources including m3u8 streams
Video history sidebar for easy access to previously processed videos
Automatic Speech Recognition (ASR) using Whisper for video transcription.
Speaker attribution in transcripts, linking spoken segments to detected speakers.
Searchable transcript interface on the results page for easy content navigation.

Application Workflow

The diagram below also illustrates the integration of Automatic Speech Recognition (ASR), speaker-to-transcript attribution, and the transcript search functionality.

graph TD
    subgraph Frontend
        A["Web Interface"] --> B["Upload Form"]
        A --> C["Video URL Input"]
        A --> D["Aspect Ratio Selection"]
        A --> E["Video History Sidebar"]
        A --> F["Results Page"]
    end

    subgraph Backend
        G["Flask App"] --> H["Video Processing"]
        G --> I["Storage Management"]
        G --> J["History Management"]
      
        subgraph "Processing Pipeline"
            H --> K["Video Download"]
            K --> L["Speaker Detection"]
            L --> M["Face Detection"]
            L --> N["Audio Analysis"]
            M & N --> O["Speaker Diarization"]
            O --> P["Smart Cropping"]
            P --> Q["Cropped Video Generation"]
            P --> R["Annotated Video Generation"]
            Q & R --> S["Thumbnail Generation"]
            O --> T1["ASR (Transcription)"]
            T1 --> T2["Speaker Attribution to Transcript"]
        end
      
        subgraph "Storage"
            I --> T["Task Directory"]
            I --> U["Upload Directory"]
            I --> V["History JSON"]
        end
    end

    %% User flows
    B --> |Upload Video| G
    C --> |Process URL| G
    D --> |Select Ratio| G
    G --> |Processing Status| A
    G --> |Store Results & Transcripts| T
    G --> |Save to History| V
    T --> |Serve Videos & Transcripts| F
    V --> |Display History| E
    E --> |Select Previous Video| F

    %% Component details
    classDef frontend fill:#d4f1f9,stroke:#05a3c7,stroke-width:2px
    classDef backend fill:#ffe6cc,stroke:#f0ad4e,stroke-width:2px
    classDef processing fill:#d9ead3,stroke:#6aa84f,stroke-width:2px
    classDef storage fill:#fff2cc,stroke:#ffbf00,stroke-width:2px
  
    class A,B,C,D,E,F frontend
    class G,H,I,J,T1,T2 backend
    %% T1, T2 are backend processing steps
    class K,L,M,N,O,P,Q,R,S processing
    class T,U,V storage

Architecture

The application follows a clean 3-layer architecture:

UI Layer: User interfaces for uploading, viewing results, and browsing history
Config Layer: Configuration options for processing parameters
Backend Layer: Core processing logic, storage, and history management

Getting Started: Running VoiceVision

Option 1: Using Docker (Recommended)

You can also run the VoiceVision web application using Docker. The official image is available on Docker Hub.

Prerequisites

Docker installed on your system.

1. Pull the Image

Pull the latest image from Docker Hub:

docker pull mehdih7/voicevision-app:latest

2. Run the Container

To run the VoiceVision application, use the following command:

docker run -d -p HOST_PORT:5001 --name voicevision-container mehdih7/voicevision-app:latest

Explanation:

-d: Runs the container in detached mode (in the background).
-p HOST_PORT:5001: Maps a port on your host machine (HOST_PORT) to port 5001 inside the container (where the VoiceVision Flask app listens).
- For example, to access the application on http://localhost:5005 on your host machine, use -p 5005:5001.
--name voicevision-container: Assigns a recognizable name to your running container.
mehdih7/voicevision-app:latest: Specifies the image to run.

Example (Accessing on host port 5005):

docker run -d -p 5005:5001 --name voicevision-container mehdih7/voicevision-app:latest

3. Access the Application

Once the container is running, open your web browser and navigate to:

http://localhost:HOST_PORT

(Replace HOST_PORT with the port you chose in the docker run command).

Persisting Data with Volumes (Recommended)

The VoiceVision application creates data in uploads/ and task/ directories inside the container. To persist this data if the container is removed, use Docker volumes:

docker run -d \
  -p HOST_PORT:5001 \
  -v /path/on/your/host/for/uploads:/app/uploads \
  -v /path/on/your/host/for/tasks:/app/task \
  --name voicevision-container \
  mehdih7/voicevision-app:latest

Replace:

HOST_PORT with your desired host port (e.g., 5005).
/path/on/your/host/for/uploads with the actual path on your computer for uploads (e.g., $(pwd)/voicevision_data/uploads to create it in your current project directory).
/path/on/your/host/for/tasks with the actual path for task data (e.g., $(pwd)/voicevision_data/tasks).

Make sure these host directories exist, or Docker might create them with root ownership.

Option 2: Manual Installation and Execution

Requirements

Python 3.8+
PyTorch
OpenCV
scipy
python_speech_features
m3u8
ffmpeg (command line tool)

Installation Steps

Clone this repository:

git clone https://github.com/simulamet-host/VoiceVision.git
cd VoiceVision

Install dependencies:
```
pip install -r requirements.txt
```
Download model weights: Models should be placed in the weights/ directory:
- talknet_speaker_v1.model: TalkNet speaker detection model
- s3fd_facedetection_v1.pth: S3FD face detection model

Running the Web Application (Development Mode)

To start the Flask web application locally:

python app.py

The application will then be accessible at http://localhost:5001.

Option 3: Programmatic Usage (as a Python Module)

The core speaker diarization logic can also be imported and used as a Python module in your own scripts:

from demo_speaker_diarization import demo_speaker_diarization

task_id = 'my-interview-task'
video_url = 'https://your-video-url.m3u8'  # Can be a local file too
output_path = f'task/{task_id}/interview_cropped.mp4'

# Generate a cropped video focusing on speakers
demo_speaker_diarization(task_id, video_url, output_path, 
                        target_ratio=(9, 16),  # For portrait mode
                        min_score=0.4)  # Speaker detection threshold

Technical Details

How It Works

VoiceVision combines face detection, audio analysis, and multi-modal fusion to identify the active speaker in each frame:

Video Processing: Loads and processes video from various sources
Face Detection: Detects and tracks faces across video frames
Audio Analysis: Extracts speech features from the audio track
Speaker Diarization: Matches audio to visual features to identify active speakers
Transcription: The audio track is processed by an ASR model (Whisper) to generate a time-stamped transcript.
Speaker Attribution: The generated transcript is aligned with the speaker diarization output to assign speaker labels to each segment of speech.
Smart Cropping: Intelligently frames the active speaker with smooth transitions (based on speaker diarization data).
Output Generation: Creates both a cropped video and an annotated version. The results page also allows searching through the transcript.

Project Structure

demo_speaker_diarization.py: Main entry point for the module
cropper.py: Smart video cropping around detected speakers
components/: Neural network components
- face_detection/: Face detection module using S3FD
- talknet_modules/: TalkNet speaker detection implementation
- encoders/: Neural network encoders for audio and video
weights/: Model weights directory
utils.py: Utility functions for video processing
task/: Output directory for processed videos

Model Architecture

The speaker detection is based on TalkNet, which uses both audio and visual cues to detect active speakers in videos. The architecture consists of:

Face detection using S3FD
Audio feature extraction with a specialized audio encoder
Visual feature extraction from detected faces
Multi-modal fusion using attention mechanism
Speaker activity detection with confidence scores

License

MIT License

Acknowledgments

TalkNet for the speaker diarization approach
S3FD for the face detection model

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
components		components
static		static
task		task
templates		templates
weights		weights
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
README_FLASK.md		README_FLASK.md
__init__.py		__init__.py
app.py		app.py
cropper.py		cropper.py
demo_speaker_diarization.py		demo_speaker_diarization.py
iphone.png		iphone.png
requirements.txt		requirements.txt
transcriber.py		transcriber.py
utils.py		utils.py
vocievsion.gif		vocievsion.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VoiceVision: AI-Powered Speaker-Aware Cropping and Content Indexing for Multi-Speaker Videos

Features

Application Workflow

Architecture

Getting Started: Running VoiceVision

Option 1: Using Docker (Recommended)

Prerequisites

1. Pull the Image

2. Run the Container

3. Access the Application

Persisting Data with Volumes (Recommended)

Option 2: Manual Installation and Execution

Requirements

Installation Steps

Running the Web Application (Development Mode)

Option 3: Programmatic Usage (as a Python Module)

Technical Details

How It Works

Project Structure

Model Architecture

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

simulamet-host/VoiceVision

Folders and files

Latest commit

History

Repository files navigation

VoiceVision: AI-Powered Speaker-Aware Cropping and Content Indexing for Multi-Speaker Videos

Features

Application Workflow

Architecture

Getting Started: Running VoiceVision

Option 1: Using Docker (Recommended)

Prerequisites

1. Pull the Image

2. Run the Container

3. Access the Application

Persisting Data with Volumes (Recommended)

Option 2: Manual Installation and Execution

Requirements

Installation Steps

Running the Web Application (Development Mode)

Option 3: Programmatic Usage (as a Python Module)

Technical Details

How It Works

Project Structure

Model Architecture

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages