Skip to content

Granary Dataset Processing (Component-Based) #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ssh-meister
Copy link
Collaborator

@ssh-meister ssh-meister commented Jul 2, 2025

This pull request introduces a set of modular components for processing the Granary dataset.

🔧 General-Purpose Processors:

These processors are not specific to any single dataset and can be reused across different data pipelines:

  1. LambdaExpression processor LambdaExpression processor implemetation #136
  2. SubRegex processor: adds support for extracting a list of regex parameters from a YAML file SubRegex processor: substitution rules from an external YAML  #137
  3. ExtractTar, RemoveFiles processors Add RemoveFiles and ExtractTar, reorganize audio converters #139
  4. FasterWhisperInference, DetectWhisperHallucinationFeatures, vLLMInference and CleanQwenGeneration Refactor inference processes & add new engines (FasterWhisper, vLLM) #141
  5. ListToEntries processor ListToEntries processor #140
  6. DropSpecifiedFields processor DropSpecifiedFields processor implemetation  #144
  7. CharacterHistogramLangValidator processor CharacterHistogramLangValidator processor implementation #154
  8. FastTextLangIdClassifier processor FastTextLangIdClassifier processor implementation #149
  9. CometoidWMTQualityEstimation processor CometoidWMTQualityEstimation processor implementation #151
  10. ConvertToTarredAudioDataset processor ConvertToTarredAudioDataset processor implemetation #145

⛓️ Pipelines

  1. Unified pipeline and README with instructions and documentation Granary large-scale speech processing pipeline  #155

@ssh-meister ssh-meister self-assigned this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant