Skip to content

Granary large-scale speech processing pipeline #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
89f3639
Added partial configs
ssh-meister Aug 7, 2025
af29b15
Skip nemo_run_config.yaml in no-nemo-tests
ssh-meister Aug 7, 2025
0db471d
Merge remote-tracking branch 'origin/main' into Granary
ssh-meister Aug 7, 2025
7b390cb
Fixed missed yaml module
ssh-meister Aug 7, 2025
555eddf
CountNumWords without alphabet
ssh-meister Aug 7, 2025
c8ae558
Fix TypeError: cannot pickle 'fasttext_pybind.fasttext' object
ssh-meister Aug 7, 2025
e39705b
Added missed methods for manifest reading and fixed gpu amount setup …
ssh-meister Aug 7, 2025
0cedf9e
Fix fasttext test
ssh-meister Aug 7, 2025
0870c36
Fix typo
ssh-meister Aug 7, 2025
3560d5b
Config and test
ssh-meister Aug 7, 2025
e18dba5
Fix docs
ssh-meister Aug 7, 2025
6e02814
Add try/except for s3.download_file
ssh-meister Aug 7, 2025
ffe8386
Docs
ssh-meister Aug 7, 2025
1c658dd
README
ssh-meister Aug 8, 2025
211e909
removed empty lines
ssh-meister Aug 8, 2025
7b88bf1
Update README
ssh-meister Aug 8, 2025
2064127
Update README
ssh-meister Aug 8, 2025
4c6dfeb
Test tests s3
ssh-meister Aug 8, 2025
8895617
to_abs_paths
ssh-meister Aug 8, 2025
d209d06
to_abs_paths
ssh-meister Aug 8, 2025
bae5f7e
to_abs_paths
ssh-meister Aug 8, 2025
65c930d
Removed wrong docker
ssh-meister Aug 8, 2025
2a2ada3
Tests fixed
Aug 8, 2025
4818eb1
New test in workflow
ssh-meister Aug 8, 2025
6557cdb
try except for test data to data when s3 download
ssh-meister Aug 8, 2025
48a7f97
Missed dependences added
ssh-meister Aug 8, 2025
c5cea28
Missed dependences added
ssh-meister Aug 8, 2025
6852f89
Turn on all tests
ssh-meister Aug 8, 2025
ac4370d
fix tests
ssh-meister Aug 8, 2025
ae2209d
Fix AWS key check
ssh-meister Aug 8, 2025
fc6b9ac
Fix AWS key check
ssh-meister Aug 8, 2025
6867c7d
Update tests.yml
ssh-meister Aug 8, 2025
44569ef
Remove overengeneered testing
ssh-meister Aug 8, 2025
4e16a67
Merge branch 'Granary' of https://github.com/NVIDIA/NeMo-speech-data-…
ssh-meister Aug 8, 2025
bcc3bf0
removed granary workflow
ssh-meister Aug 8, 2025
2796f6d
Moved Granary to cfg_end_to_end tests
ssh-meister Aug 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ jobs:
pip install Cython wheel # need to pre-install to avoid error in nemo installation
pip install nemo-toolkit[asr,nlp]==2.2.1
pip install nemo_text_processing
pip install pymarian
pip install -r requirements/huggingface.txt
pip install pymarian
pip install certifi #this needed to avoid problems with certificates [COORAL]
export SSL_CERT_FILE=$(python -m certifi)
python -m pip cache purge
Expand All @@ -93,7 +93,13 @@ jobs:
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
sudo update-ca-certificates # [cert for CORAL]
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
python -m pytest tests/ --junitxml=pytest.xml --ignore=tests/test_tts_sdp_end_to_end.py --cov-report=term-missing:skip-covered --cov=sdp --durations=30 -rs | tee pytest-coverage.txt
python -m pytest tests/ \
--junitxml=pytest.xml \
--ignore=tests/test_tts_sdp_end_to_end.py \
--cov-report=term-missing:skip-covered \
--cov=sdp \
--durations=30 \
-rs | tee pytest-coverage.txt


# TODO: add some way to see if e2e tests were skipped
Expand Down
166 changes: 166 additions & 0 deletions dataset_configs/multilingual/granary/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
## Granary Dataset Creation Pipeline

### Overview

This configuration drives the **Granary pseudo-labelling pipeline** – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for **25 European languages**.

The first public release of **Granary** (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:

- [espnet/yodas2](https://huggingface.co/datasets/espnet/yodas2)
- [FBK-MT/mosel](https://huggingface.co/datasets/FBK-MT/mosel)
- [PleIAs/YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons)

and is published as [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary).

> Note — Per-language runs
>
> The pipeline is executed once per language pair: set
> - `source_lang` / `source_lang_full` – audio & transcript language
> - `translation.target_lang` / `target_lang_full` – translation language
>
> For example, to obtain English audio with Italian translations choose `source_lang: en` and `translation.target_lang: it`. Separate runs are required for each additional language combination.

> Note — GPU required
>
> All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when `num_devices: -1` (default) is used.

### Software prerequisites

Install NeMo-speech-data-processor plus the extra wheels required by specific processors:

- `FasterWhisperInference`

```bash
pip install pytorch-lightning \
"nvidia-cublas-cu12" \
"nvidia-cudnn-cu12==9.*" \
faster_whisper

export LD_LIBRARY_PATH=$(python - <<'PY'
import os, nvidia.cublas.lib, nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))
PY)
```

- `vLLMInference`

```bash
pip install "optree>=0.13.0" vllm
```

- `CometoidWMTQualityEstimation`

```bash
pip install pymarian
```

- `FastTextLangIdClassifier`

```bash
pip install fasttext
```

- `ConvertToTarredAudioDataset` (optional, only if tar-sharding is enabled)

```bash
pip install lhotse "nemo-toolkit[common]==2.2.1"
```

### Quick start

1. **Hardware** – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).
2. **Install** NeMo-speech-data-processor and the extras listed above.
3. **Prepare** the input manifest and set three mandatory YAML keys:
- `input_manifest_file` – manifest with raw audio paths
- `output_dir` – working/output directory
- `sdp_dir` – root of the SDP tree (for prompt/regex assets)
4. **Run the pipeline**:

```bash
# Path to your local clone of NeMo-speech-data-processor
SDP_DIR=/path/to/NeMo-speech-data-processor

python ${SDP_DIR}/main.py \
--config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \
--config-name config.yaml \
input_manifest_file=/path/to/input_manifest.json \
output_dir=/path/to/output/dir \
sdp_dir=${SDP_DIR}
```

### Input and output formats

#### Input manifest

Each line is a JSON object with the source-audio path:

```json
{"source_audio_filepath": "/path/to/file.flac"}
```

#### Key outputs

- `${output_dir}/${source_lang}/manifest_46.json` – final bilingual manifest containing `audio_filepath`, `offset`, `duration`, `text` (source) and `answer` (translation), plus constant decoder flags.
- `${output_dir}/${source_lang}/tarred_dataset/` – optional tarred-audio shards and `shard_manifest.json` when `convert_to_audio_tarred_dataset.should_run: True`.
- All intermediate `manifest_XX.json` files are kept for audit/debug.

### Pipeline stages

The processors executed (indices match the config):

- **FfmpegConvert** (0) – re-encode audio to 16 kHz/mono FLAC.
- **GetAudioDuration** (1) – compute clip length.
- **RemoveFiles** (2) – optionally delete originals (`params.save_disk_space`).
- **FasterWhisperInference** (3) – pass 1 language detection.
- **LambdaExpression** (4) – probability-based LID filtering.
- **DropSpecifiedFields** (5) – remove temporary fields.
- **FasterWhisperInference** (6, 14) – two-pass transcription (second run can slice by offset).
- **Segmentation & grooming** (7–13) – split Whisper segments into atomic utterances.
- **Hallucination detection** (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.
- **PnC restoration** (21–23) – `Qwen-2.5-7B` restores punctuation & capitalisation; optional regex clean-up.
- **Length & charset filtering** (27–36) – word-ratio, character histogram and FastText checks.
- **Quality estimation** (41–43) – keep pairs with `Comet-QE score ≥ min_qe_score`.
- **Constant flags** (44) – add decoder directives (`<|emo:undefined|>`, `itn`, `pnc`, etc.).
- **Tarred dataset** (46) – shard audio into `num_shards` tar files (optional).

### Tunable parameters

All knobs live under the `params` block.

- **Language**
- `source_lang` / `source_lang_full`
- `translation.target_lang` / `target_lang_full`

- **Audio duration**
- `min_audio_duration` – drop very short clips (seconds)
- `max_audio_duration` – drop very long clips (seconds)

- **Language-ID & text filtering**
- `min_audio_lid_probability` – Whisper LID threshold
- `translation.min_hist_token_ratio` – charset-purity ratio
- `translation.min_text_lid_probability` – FastText LID threshold

- **Length & quality**
- `translation.max_len_diff_ratio` – max(src / tgt) word ratio
- `translation.min_qe_score` – Comet-QE acceptance score

- **Tarred dataset**
- `convert_to_audio_tarred_dataset.should_run` (bool)
- `num_shards` and `buckets_num` – shard layout

- **Misc.**
- `use_regex` – regex preset for text normalisation
- `save_disk_space` – delete originals after conversion
- `use_dask` – enable distributed execution (not recommended)

### Advanced usage

- **Selective execution** – override `processors_to_run` with a range of indices, e.g. `"0:25"`.
- **Model swapping** – every inference processor exposes either `model_size_or_path` (Whisper) or an embedded `model:` block (vLLM).
- **Resource tuning** – `num_devices = -1` uses all visible GPUs; set an integer to pin workers per stage.

### References

- Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: [2505.13404](https://arxiv.org/abs/2505.13404),
- [nvidia/Granary](https://huggingface.co/datasets/nvidia/Granary) dataset on Hugging Face,
- NeMo-SDP source [code](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/multilingual/granary/>).
Loading
Loading