DeepFabric is a powerful synthetic dataset generation framework that leverages LLMs to create high-quality, diverse training data at scale. Built for ML engineers, researchers, and AI developers, it streamlines the entire dataset creation pipeline from topic generation to model-ready formats.
No more unruly models failing to Tool call or comply with reams of natural language to try and yield structured formats. DeepFabric ensures your models are consistent, well-structured, and ready for fine-tuning or evaluation.
- Hierarchical Topic Generation: Tree and graph-based architectures for comprehensive domain coverage
- Multi-Format Export: Direct export to popular training formats (no conversion scripts needed)
- Conversation Templates: Support for various dialogue patterns and reasoning styles
- Tool Calling Support: Generate function-calling and agent interaction datasets
- Structured Output: Pydantic & Outlines enforced schemas for consistent, high-quality data
- Multi-Provider Support: Works with OpenAI, Anthropic, Google, Ollama, and more
- HuggingFace Integration: Direct dataset upload with auto-generated cards
Format | Template | Use Case | Framework Compatibility |
---|---|---|---|
Alpaca | builtin://alpaca.py |
Instruction-following | Stanford Alpaca, LLaMA |
ChatML | builtin://chatml.py |
Multi-turn conversations | Most chat models |
Unsloth | builtin://unsloth.py |
Optimized fine-tuning | Unsloth notebooks |
GRPO | builtin://grpo.py |
Mathematical reasoning | GRPO training |
Im Format | builtin://im_format.py |
Chat with delimiters | ChatML-compatible models |
Tool Calling | builtin://tool_calling.py |
Function calling | Agent training |
Single Tool Call | builtin://single_tool_call.py |
Individual tool calls | Single function execution |
Harmony | builtin://harmony.py |
Reasoning with tags | (gpt-oss) |
Custom | file://your_format.py |
Your requirements | Any framework |
You can create your own custom output format by implementing a simple Python class with a format
method using the deepfabric
library and BaseFormatter
class. See the Custom Format Guide for details.
Template Type | Description | Example Use Case |
---|---|---|
Single-Turn | Question β Answer | FAQ, classification |
Multi-Turn | Extended dialogues | Chatbots, tutoring |
Chain of Thought (CoT) | Step-by-step reasoning | Math, logic problems |
Structured CoT | Explicit reasoning traces | Educational content |
Hybrid CoT | Mixed reasoning styles | Complex problem-solving |
Tool Calling | Function invocations | Agent interactions |
System-Prompted | With system instructions | Role-playing, personas |
If there's a format or feature you'd like to see, please open an issue.
DeepFabric is designed to work within a modular MLOps pipeline, allowing you to customize each stage of the dataset generation process. The main components are:
- Topic Generation: Create a structured topic tree or graph based on a high-level prompt.
- Data Generation: Generate training examples for each topic using LLMs.
- Format Engine: Convert raw outputs into your desired dataset format.
graph LR
A[Topic Prompt] --> B[Topic Tree/Graph]
B --> C[Data Generator]
C --> D[Format Engine]
D --> E[Export/Upload]
By decoupling these components, you can easily swap out models, prompts, and formats to suit your specific needs, along with version controlling your configurations for reproducibility.
pip install deepfabric
# Set your API key (or use Ollama for local generation)
export OPENAI_API_KEY="your-api-key"
# Generate a dataset with a single command
deepfabric generate \
--mode tree \
--provider openai \
--model gpt-4o \
--depth 3 \
--degree 3 \
--num-steps 9 \
--batch-size 1 \
--topic-prompt "This history Quantum physics" \
--generation-system-prompt "You are an expert on academic history, with a specialism in the sciences" \
--dataset-save-as dataset.jsonl
Deepfabric will automatically:
- Generate a hierarchical topic tree (3 levels deep, 3 branches per level)
- Create 9 diverse Q&A pairs across the generated topics
- Save your dataset to
dataset.jsonl
Your dataset is ready in the OpenAI standard instruct format (JSONL):
{
"messages": [
{
"role": "user",
"content": "Can you explain Albert Einstein's contribution to quantum theory?"
},
{
"role": "assistant",
"content": "Albert Einstein made significant contributions to quantum theory, particularly through his explanation of the photoelectric effect, for which he won the Nobel Prize in 1921. He proposed that light could be thought of as discrete packets of energy called quanta or photons, which could explain how electrons are emitted from metals when exposed to light. This idea was instrumental in the development of quantum mechanics. He later became famous for his skepticism about quantum mechanics probabilistic interpretation, leading to his quote \"God does not play dice with the universe.\""
}
]
}
Generate larger datasets with different models:
# With a depth of 4 and degree of 4^5 = 1,024
deepfabric generate \
--provider ollama \
--model qwen3:32b \
--depth 4 \
--degree 5 \
--num-steps 100 \
--batch-size 5 \
--topic-prompt "Machine Learning Fundamentals"
--generation-system-prompt "You are an expert on Machine Learning and its application in modern technologies" \
--dataset-save-as dataset.jsonl
There are lots more examples to get you going.
Mode | Structure | Use Case | Max Topics |
---|---|---|---|
Tree | Hierarchical branching | Well-organized domains | depth^degree |
Graph | DAG with cross-connections | Interconnected concepts | Flexible |
Linear | Sequential topics | Simple lists | User-defined |
Custom | User-provided structure | Specific requirements | Unlimited |
Provider | Models | Best For | Local/Cloud |
---|---|---|---|
OpenAI | GPT-4, GPT-4o, GPT-3.5 | High quality, complex tasks | Cloud |
Anthropic | Claude 3.5 Sonnet, Haiku | Nuanced reasoning | Cloud |
Gemini 2.0, 1.5 | Cost-effective at scale | Cloud | |
Ollama | Llama, Mistral, Qwen, etc. | Privacy, unlimited generation | Local |
Together | Open models | Fast inference | Cloud |
Groq | Llama, Mixtral | Ultra-fast generation | Cloud |
DeepFabric uses a flexible YAML-based configuration with extensive CLI overrides:
# Main system prompt - used as fallback throughout the pipeline
dataset_system_prompt: "You are a helpful AI assistant providing clear, educational responses."
# Topic Tree Configuration
# Generates a hierarchical topic structure using tree generation
topic_tree:
topic_prompt: "Python programming fundamentals and best practices"
# LLM Settings
provider: "ollama" # Options: openai, anthropic, gemini, ollama
model: "qwen3:0.6b" # Change to your preferred model
temperature: 0.7 # 0.0 = deterministic, 1.0 = creative
# Tree Structure
degree: 2 # Number of subtopics per node (1-10)
depth: 2 # Depth of the tree (1-5)
# Topic generation prompt (optional - uses dataset_system_prompt if not specified)
topic_system_prompt: "You are a curriculum designer creating comprehensive programming learning paths. Focus on practical concepts that beginners need to master."
# Output
save_as: "python_topics_tree.jsonl" # Where to save the generated topic tree
# Data Engine Configuration
# Generates the actual training examples
data_engine:
instructions: "Create clear programming tutorials with working code examples and explanations"
# LLM Settings (can override main provider/model)
provider: "ollama"
model: "qwen3:0.6b"
temperature: 0.3 # Lower temperature for more consistent code
max_retries: 3 # Number of retries for failed generations
# Content generation prompt
generation_system_prompt: "You are a Python programming instructor creating educational content. Provide working code examples, clear explanations, and practical applications."
# Dataset Assembly Configuration
# Controls how the final dataset is created and formatted
dataset:
creation:
num_steps: 4 # Number of training examples to generate
batch_size: 1 # Process 3 examples at a time
sys_msg: true # Include system messages in output format
# Output
save_as: "python_programming_dataset.jsonl"
# Optional Hugging Face Hub configuration
huggingface:
# Repository in format "username/dataset-name"
repository: "your-username/your-dataset-name"
# Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
token: "your-hf-token"
# Additional tags for the dataset (optional)
# "deepfabric" and "synthetic" tags are added automatically
tags:
- "deepfabric-generated-dataset"
- "geography"
Run using the CLI:
deepfabric generate config.yaml
The CLI supports various options to override configuration values:
deepfabric generate config.yaml \
--save-tree output_tree.jsonl \
--dataset-save-as output_dataset.jsonl \
--model-name ollama/qwen3:8b \
--temperature 0.8 \
--degree 4 \
--depth 3 \
--num-steps 10 \
--batch-size 2 \
--sys-msg true \ # Control system message inclusion (default: true)
--hf-repo username/dataset-name \
--hf-token your-token \
--hf-tags tag1 --hf-tags tag2
CoT Style | Template Pattern | Best For |
---|---|---|
Free-text | Natural language steps | Mathematical problems (GSM8K-style) |
Structured | Explicit reasoning traces | Educational content, tutoring |
Hybrid | Mixed reasoning | Complex multi-step problems |
# Example: Structured CoT configuration
data_engine:
conversation_template: "cot_structured"
cot_style: "mathematical"
include_reasoning_tags: true
- Deduplication: Automatic removal of similar samples
- Validation: Schema enforcement for all outputs
- Retry Logic: Automatic retry with backoff for failures
- Error Tracking: Detailed logs of generation issues
- Progress Monitoring: Real-time generation statistics
Resource | Description | Link |
---|---|---|
Documentation | Complete API reference & guides | docs |
Examples | Ready-to-use configurations | examples/ |
Discord | Community support | Join Discord |
Issues | Bug reports & features | GitHub Issues |
Deepfabric development is moving at a fast pace πββοΈ, for a great way to follow the project and to be instantly notified of new releases, Star the repo.
We welcome contributions! Check out our good first issues to get started.
git clone https://github.com/lukehinds/deepfabric
cd deepfabric
uv sync --all-extras # Install with dev dependencies
make test # Run tests
make format # Format code
- Discord: Join our community for real-time help
- Issues: Report bugs or request features
- Discussions: Share your use cases and datasets
If you're using DeepFabric in production or research, we'd love to hear from you! Share your experience in our Discord or open a discussion.
Use Case | Description | Example Config |
---|---|---|
Model Distillation | Teacher-student training | distillation.yaml |
Evaluation Benchmarks | Model testing datasets | benchmark.yaml |
Domain Adaptation | Specialized knowledge | domain.yaml |
Agent Training | Tool-use & reasoning | agent.yaml |
Instruction Tuning | Task-specific models | instruct.yaml |
Math Reasoning | Step-by-step solutions | math.yaml |
- Start Small: Test with
depth=2, degree=3
before scaling up - Mix Models: Use stronger models for topics, faster ones for generation
- Iterate: Generate small batches and refine prompts based on results
- Validate: Always review a sample before training
- Version Control: Save configurations for reproducibility
We use fully anonymised analytics, to help us improve application performance and stability. We never send Personal identifiable information and we do not capture prompts, generated content, API keys, file names etc.
We capture model names, numeric parameters (temperature, depth, degree, batch_size), timing and success/failure rates - this then helps us find optimizations or bottlenecks.
You can fully disable all analytics by setting the environment variable ANONYMIZED_TELEMETRY=False
.