We introduce SDAR (Synergy of Diffusion and AutoRegression), a large-scale diffusion language model that unites the complementary strengths of autoregressive and discrete diffusion modeling. By merging the training efficiency of autoregressive methods with the highly parallel decoding ability of diffusion models, SDAR delivers performance competitive with state-of-the-art open-source AR models. It sets a new standard as the most powerful diffusion-based language model to date—particularly excelling as a generalist model with strong specialist capabilities.
Highlights:
- 🚀 Low-Cost AR-to-BlockDiffusion
- ⚡ 2-4× Faster Inference
- 🧠 Advanced performance on science reasoning bechmarks (e.g., GPQA and ChemBench)
SDAR is still an early experimental state, we are actively developing more systematic and warmly welcome collaborations in this direction.
- [2025-10-29] We have open-sourced our downstream task fine-tuning framework, powered by LlamaFactory. It provides a powerful and user-friendly toolkit for adapting SDAR to your specific needs 🛠️.
- [2025-10-10] We've implemented an industrial-grade inference solution for SDAR models on the lmdeploy framework, providing robust and efficient deployment infrastructure for production environments 🚀.
- [2025-09-09] We’ve open-sourced the weights for models with various block sizes. Alongside our default model (block size=4), you can now find models with block sizes of 8, 16, 32, 64 on the Hugging Face 🤗.
- [2025-08-18] We’ve open-sourced the weights for our SDAR-30B-A3B-Sci model — now available on Hugging Face 🤗.
- [2025-08-13] We’ve released the inference code for SDAR models, including a built-in script and a third-party inference engine JetEngine 🚀.
- [2025-07-20] We’ve open-sourced the weights for our 1.7B, 4B, 8B dense models, along with our 30B MoE model — now available on Hugging Face 🤗.
- SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation
For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.
transformers>=4.52.4
python generate.py \
--model_dir=JetLM/SDAR-1.7B-Chat \
--trust_remote_code2. Using the prepared inference engine JetEngine (For batch inference and production level speedup)
JetEngine, a lightweight inference engine for the SDAR series built on nano-vllm support both dense and MoE models and Tensor Parallel distributed inference, delivers tons of acceleration compared to the naive implementation.
In our benchmark, we tested the 4B SDAR model with block size 4 (basic acceleration setting) and batch size 128:
- On NVIDIA A800, JetEngine reached 1800+ tokens/second.
- On NVIDIA H200, JetEngine achieved 3700+ tokens/second using FlashAttention-2 + Triton kernels.
This demonstrates that JetEngine can unlock production-level throughput for SDAR models, making it ideal for both research-scale batch inference and real-world deployment scenarios.
pip install flash-attn --no-build-isolation #Install fa2
git clone https://github.com/JetAstra/SDAR.git
cd SDAR
git submodule update --init --recursive
cd third_party/JetEngine
pip install .The following example shows how to quickly load a model with JetEngine and run a prompt end-to-end.
import os
from jetengine import LLM, SamplingParams
from transformers import AutoTokenizer
model_path = os.path.expanduser("/path/to/your/model")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Initialize the LLM
llm = LLM(
model_path,
enforce_eager=True,
tensor_parallel_size=1,
mask_token_id=151669, # Optional: only needed for masked/diffusion models
block_length=4
)
# Set sampling/generation parameters
sampling_params = SamplingParams(
temperature=1.0,
topk=0,
topp=1.0,
max_tokens=256,
remasking_strategy="low_confidence_dynamic",
block_length=4,
denoising_steps=4,
dynamic_threshold=0.9
)
# Prepare a simple chat-style prompt
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Explain what reinforcement learning is in simple terms."}],
tokenize=False,
add_generation_prompt=True
)
# Generate text
outputs = llm.generate_streaming([prompt], sampling_params)3. Using the prepared inference engine LMDeploy (For batch inference and production level speedup)
from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.pytorch.tools.utils import Timer, visualize_pipe_out
if __name__ == '__main__':
model_path = 'JetLM/SDAR-8B-Chat'
prompts = [
[dict(role="user", content="Given the function $f(x) = \\frac{4x^2 - 4x + 4}{x^2 + 2x + 4}$, where $x \\in \\mathbb{R}$, determine its minimum value.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
[dict(role="user", content="If the domain of the function $\\log x^2$ is $x < a$ or $x > b$, for some $a$ and $b$, find $a + b$.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
[dict(role="user", content="Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\nRemember to put your final answer within \\boxed{}.\n")],
[dict(role="user", content="Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ are integers between $-100$ and $100$, inclusive, such that $12x^{2}-xy-6y^{2}=0$.\nRemember to put your final answer within \\boxed{}.\n")],
]
backend_config = PytorchEngineConfig(
tp=1,
dtype="float16",
max_prefill_token_num=4096,
cache_max_entry_count=0.8,
dllm_block_length=4,
dllm_denoising_steps=4,
dllm_unmasking_strategy="low_confidence_dynamic",
dllm_confidence_threshold=0.9,
)
pipe = pipeline(model_path, backend_config=backend_config)
gen_config = GenerationConfig(
top_p=0.95,
top_k=50,
temperature=1.0,
do_sample=False, # greedy decoding
max_new_tokens=4096,
)
outputs = self.pipe(prompts, gen_config=gen_config)
print(outputs.text)
We start from Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, and Qwen3-30B-A3B-Base.
Each model is continued-pretrained on 50B tokens (~0.14%) of relatively low-quality open-source data, followed by supervised fine-tuning (4B tokens).
The default model maintains a block size of 4 throughout its entire training process. For block size scaling, we use a block size of 4 during the continued pretraining phase, and directly increase it to the target block size (e.g., 8, 16, 32, or 64) during the SFT phase.
- SDAR training: SDAR-1.7B-Chat / SDAR-4B-Chat / SDAR-8B-Chat / SDAR-30B-A3B-Chat.
- AR training: Qwen3-1.7B-AR-SFT / Qwen3-30B-AR-SFT.
- Decoding
- SDAR family: greedy decoding with
block_length = 4,denoising_steps = 4. - AR baselines: greedy decoding.
- SDAR family: greedy decoding with
- Base model sources
- Qwen3-1.7B-Base / Qwen3-30B-Base are taken from the Qwen3 Technical Report.
Table 1. Overall performance across general benchmarks.

Note
- SDAR-1.7B-Chat is on par with Qwen3-1.7B-AR-SFT across most benchmarks.
- SDAR-30B-A3B-Chat performs comparably to Qwen3-30B-AR-SFT.
We compare SDAR-30B-A3B-Chat and Qwen3-30B-AR-SFT under static and dynamic decoding:
- Static: each decoding step emits a fixed number of tokens, independent of confidence.
-
Dynamic: within a block, once the confidence exceeds a threshold
$\theta$ , the decoder generate multiple tokens at once (up to the block size).
Figure 1. Accuracy–speedup under static vs. dynamic inference; dynamic threshold sweeps relative to static.
Note
- SDAR delivers >2× speedup over static inference with negligible accuracy loss; its static speed is comparable to AR models.
- The speedup scales with model size, making SDAR increasingly favorable for larger models.
We start from Qwen3-30B-A3B-Base and derive two science-oriented bases via large-scale pretraining and annealing, followed by reasoning SFT:
- 500B tokens (continual pretraining) + 500B tokens (annealing) → AR-30B-A3B-Sci-Base
- From the annealing corpus, sample 50B tokens and continue training with SDAR → SDAR-30B-A3B-Sci-Base
- Fine-tune both bases on reasoning datasets → AR-30B-A3B-Sci and SDAR-30B-A3B-Sci
- Decoding & inference.
- AR: sampling decoding with
temperature=0.6,top_p=0.95,top_k=20. - SDAR:
block_length=4,denoising_steps=4; we report both (G) greedy and (S) sampling (temperature=1.0,top_p=1.0,top_k=0) decoding strategies.
- AR: sampling decoding with
- Reporting protocol.
Averages over 8 runs for GPQA and 32 runs for AIME 202, AIME 2025, and LMB-hard.
Abbreviations: LMB = LiveMathBench, LCB = LiveCodeBench, (S) = sampling, (G) = greedy.
Table 2. Strict comparison under identical backbones and datasets. Benchmarks on general reasoning, mathematics and code generation.
Table 3. Strict comparison under identical backbones and datasets. Benchmarks on scientific domains.
Note
SDAR-30B-A3B-Sci consistently outperforms AR-30B-A3B-Sci, with pronounced gains on science-focused tasks such as GPQA and ChemBench.
We position SDAR-30B-A3B-Sci against leading open- and closed-source LLMs. External scores are taken from InternLM/Intern-S1.
Table 3. Positioning against external models (sources: InternLM/Intern-S1).

| Model | Type | Link |
|---|---|---|
| SDAR-1.7B-Chat | Chat | huggingface.co/JetLM/SDAR-1.7B-Chat |
| SDAR-4B-Chat | Chat | huggingface.co/JetLM/SDAR-4B-Chat |
| SDAR-8B-Chat | Chat | huggingface.co/JetLM/SDAR-8B-Chat |
| SDAR-30B-A3B-Chat | Chat | huggingface.co/JetLM/SDAR-30B-A3B-Chat |
| SDAR-30B-A3B-Sci | Thinking (Science) | huggingface.co/JetLM/SDAR-30B-A3B-Sci |
- Release SDAR Technical Report
- Release Inference Engine and Training Framework
- More Features are working in progress
- Shuang Cheng: Initial idea proposal, model evaluation, and inference.
- Yihan Bian: Engineering optimization, inference & training acceleration, MOE training code implementation.
- Dawei Liu: Implementation of model training code, training experiments.
- Biqing Qi: Project Leader and overall coordination.
Note
Note: This project is a collaborative effort, with all contributors solving challenges together.
For the full list of contributors, please refer to the author list in the citation. We are also deeply grateful to everyone who engaged in discussions and provided valuable feedback throughout the development of this project.
We would like to express our gratitude to the following works (MDLM, LLaDA, DiffuLLaMA, Block Diffusion) for providing important theoretical foundations and inspiration for SDAR.
For issues or inquiries:
- Shuang Cheng, Shanghai AI Lab ([email protected])
- Biqing Qi (Corrsponding Author), Shanghai AI Lab ([email protected])
@misc{JetAstra2025,
title={SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation},
author={Shuang Cheng and Yihan Bian and Dawei Liu and Yuhua Jiang and Yihao Liu and Linfeng Zhang and Wenghai Wang and Qipeng Guo and Kai Chen and Biqing Qi* and Bowen Zhou},
year={2025},
institution={Shanghai AI Lab},
url={https://github.com/JetAstra/SDAR}
}


