This repository fine-tunes the Voxtral speech model on conversational speech datasets using the Hugging Face transformers and datasets libraries.
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASRChoose your preferred package manager:
📦 Using UV (recommended)
uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install -r requirements.txt🐍 Using pip
python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtPerfect — here’s a drop-in replacement for your README’s “Dataset Preparation” that matches your script (uses hf-audio/esb-datasets-test-only-sorted with the voxpopuli config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements.
For ASR fine-tuning, inputs look like:
- Inputs:
[AUDIO] … [AUDIO] <transcribe> <reference transcription> - Labels: same sequence, but the prefix
[AUDIO] … [AUDIO] <transcribe>is masked with-100so loss is computed only on the transcription tokens.
The VoxtralDataCollator already builds this sequence (prompt expansion via the processor and label masking).
The dataset only needs two fields:
{
"audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...},
"text": "<reference transcription>"
}If you want to swap to a different dataset, ensure after loading you still have:
- an
audiocolumn (cast toAudio(sampling_rate=16000)), and - a
textcolumn (the reference transcription).
If your dataset uses different column names, map them to audio and text before returning.
Run the training script:
uv run train.pyLogs and checkpoints will be saved under the outputs/ directory by default.
You can also run the training script with LoRA:
uv run train_lora.pyHappy fine-tuning Voxtral! 🚀