This repository accompanies the Swiss Parliament Corpus (SPC-R) dataset and the paper "SPC-R: High-Precision Speech Corrections for Swiss Parliament Data". It provides the scripts used to generate RAG-augmented correction prompts, submit OpenAI Batch jobs, and consolidate the returned corrections and quality judgements. The raw data is not shared in this repository, only the code.
The idea is to correct a weakly labeled transcription of a parliamentary discussion with an LLM, which receives as context semantically relevant chunks from a manually generated summary of it.
- Python 3.10 or newer.
uvfor dependency management.- An OpenAI API key with access to the
gpt-4ofamily and the Batch API. Export it before running any command:export OPENAI_API_KEY="sk-..."
Install the dependencies via uv once:
uv syncuv will create an isolated environment and install the packages defined in pyproject.toml. Use uv run ... for every command below so the managed environment is reused.
| Script | Purpose |
|---|---|
correct_transcriptions.py |
Generates correction or judgement prompts, optionally writes OpenAI Batch JSONL jobs, or runs the correction + judgement pipeline synchronously. |
submit_correction_job.py / submit_judgement_job.py |
Upload batch JSONL files to OpenAI and start 24h batch runs. |
retrieve_correction_results.py / retrieve_judgement_results.py |
Poll batch jobs until completion and download the answer JSONL files. |
add_correction_to_file.py |
Merge correction batch answers back into the original transcription JSON files. |
add_judgement_to_file.py |
Merge judgement batch answers back into the already corrected JSON files. |
batch_utils.py |
Thin OpenAI Batch helper functions shared by the CLI utilities. |
utils.py |
PDF extraction, embedding cache management, and prompt construction helpers. |
The workflow operates in two phases: correction and judgement. Both can run synchronously (direct API calls) or asynchronously via batches. The batch route scales better and matches the experiments in the paper.
UPDATE: We would strongly recommend to work with Supervised fine-tuned Models, especially for the quality scoring (Step 5).
uv run correct_transcriptions.py \
--folder data/spc_r/kanton_be_grosser_rat/2018_06 \
--pdf data/spc_r/kanton_be_grosser_rat/2018_06/tagblatt.pdf \
--batch --step correction \
--correction_model gpt-4o \
--temperature 0.1--folderprocesses every JSON file inside the folder (excluding already corrected ones).- The script extracts text from the provided summary PDF, chunks it, builds (or loads) cached embeddings in
embeddings/, and writes*_batch.jsonlnext to each JSON file. - For a single file use
--file /path/to/transcript.jsoninstead of--folder.
uv run submit_correction_job.py --jsonl_folder data/spc_r/kanton_be_grosser_rat/2018_06- uploads every
*_batch.jsonlto OpenAI and starts a batch. - writes
<file>.batch_idnext to the JSONL so you can poll later.
uv run retrieve_correction_results.py --batch_job_id_folder data/spc_r/kanton_be_grosser_rat/2018_06- waits for each recorded batch id to complete and writes
*_batch_answer.jsonlfiles next to the submissions.
uv run add_correction_to_file.py --folder_original_files data/spc_r/kanton_be_grosser_rat/2018_06- reads the original JSON, matches by
custom_id, and writes*_corrected.jsoncontainingsegments[<id>]["corrected_text"]. - these corrected files are the inputs for the judgement stage.
Repeat the same four steps with --step judgement and the judgement scripts:
- Generate JSONL requests for already corrected files:
uv run correct_transcriptions.py \ --folder data/spc_r/kanton_be_grosser_rat/2018_06 \ --pdf data/spc_r/kanton_be_grosser_rat/2018_06/tagblatt.pdf \ --batch --step judgement \ --judge_model gpt-4o-mini \ --temperature 0.1
- Submit with
uv run submit_judgement_job.py ...(note the default glob*_corrected_batch.jsonl). - Retrieve results with
uv run retrieve_judgement_results.py .... - Merge answers into
*_corrected_judged.jsonviauv run add_judgement_to_file.py --folder_original_files ....
The resulting JSON files contain both corrected_text and judgement metadata for each segment, mirroring the released dataset.
For small experiments you can bypass batches by omitting --batch:
uv run correct_transcriptions.py \
--file data/spc_r/.../20180606_03_test.json \
--pdf data/spc_r/.../tagblatt.pdf \
--correction_model gpt-4o --judge_model gpt-4o-mini --temperature 0.1The script will sequentially call the Chat Completions API for each segment, first for corrections, then for judgements, and write <file>_corrected.json once finished.
- JSON transcripts and their derived files stay together. Batch scripts derive filenames automatically:
*.json→ correction input*_batch.jsonl→ correction batch requests*_batch_answer.jsonl→ correction batch responses*_corrected.json→ transcript augmented withcorrected_text*_corrected_batch.jsonl→ judgement batch requests*_corrected_batch_answer.jsonl→ judgement batch responses*_corrected_judged.json→ final output with scores
- Embeddings are cached in
embeddings/(created on demand). You can delete the folder at any time to rebuild embeddings when the summary text changes.
- Missing dependencies: re-run
uv sync.uvkeeps the lock file up to date whenever you change dependencies. - Rate limits / retries: the batch helpers automatically back off using exponential waits when OpenAI returns rate-limit errors.
- Incorrect file names: ensure the patterns follow the dataset convention.
custom_idis constructed from<basename>.json__<segment_id>__<step>so the merge utilities can map answers back to segments. - PDF OCR quirks: if the toolkit saves
tagblatt.txtnext to the PDF you can edit it manually; the next run reuses the cached text.
The code is released under the terms of the repository's LICENSE. Dataset licensing and responsible-use statements are documented on the Hugging Face dataset card.