Skip to content
/ SDAR Public

SDAR (Synergy of Diffusion and AutoRegression), a large diffusion language model(1.7B, 4B, 8B, 30B)

License

Notifications You must be signed in to change notification settings

JetAstra/SDAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT Website: SDAR HuggingFace: Models Technical Report: Arxiv

We introduce SDAR (Synergy of Diffusion and AutoRegression), a large-scale diffusion language model that unites the complementary strengths of autoregressive and discrete diffusion modeling. By merging the training efficiency of autoregressive methods with the highly parallel decoding ability of diffusion models, SDAR delivers performance competitive with state-of-the-art open-source AR models. It sets a new standard as the most powerful diffusion-based language model to date—particularly excelling as a generalist model with strong specialist capabilities.

Highlights:

  • 🚀 Low-Cost AR-to-BlockDiffusion
  • ⚡ 2-4× Faster Inference
  • 🧠 Advanced performance on science reasoning bechmarks (e.g., GPQA and ChemBench)

SDAR is still an early experimental state, we are actively developing more systematic and warmly welcome collaborations in this direction.

🔥 News

  • [2025-10-29] We have open-sourced our downstream task fine-tuning framework, powered by LlamaFactory. It provides a powerful and user-friendly toolkit for adapting SDAR to your specific needs 🛠️.
  • [2025-10-10] We've implemented an industrial-grade inference solution for SDAR models on the lmdeploy framework, providing robust and efficient deployment infrastructure for production environments 🚀.
  • [2025-09-09] We’ve open-sourced the weights for models with various block sizes. Alongside our default model (block size=4), you can now find models with block sizes of 8, 16, 32, 64 on the Hugging Face 🤗.
  • [2025-08-18] We’ve open-sourced the weights for our SDAR-30B-A3B-Sci model — now available on Hugging Face 🤗.
  • [2025-08-13] We’ve released the inference code for SDAR models, including a built-in script and a third-party inference engine JetEngine 🚀.
  • [2025-07-20] We’ve open-sourced the weights for our 1.7B, 4B, 8B dense models, along with our 30B MoE model — now available on Hugging Face 🤗.

📑 Contents

⚙️ Usage

Training

For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.

Inference

transformers>=4.52.4

1. Using the built-in inference script

python generate.py \
  --model_dir=JetLM/SDAR-1.7B-Chat \
  --trust_remote_code

2. Using the prepared inference engine JetEngine (For batch inference and production level speedup)

JetEngine, a lightweight inference engine for the SDAR series built on nano-vllm support both dense and MoE models and Tensor Parallel distributed inference, delivers tons of acceleration compared to the naive implementation.

In our benchmark, we tested the 4B SDAR model with block size 4 (basic acceleration setting) and batch size 128:

  • On NVIDIA A800, JetEngine reached 1800+ tokens/second.
  • On NVIDIA H200, JetEngine achieved 3700+ tokens/second using FlashAttention-2 + Triton kernels.

This demonstrates that JetEngine can unlock production-level throughput for SDAR models, making it ideal for both research-scale batch inference and real-world deployment scenarios.

pip install flash-attn --no-build-isolation #Install fa2
git clone https://github.com/JetAstra/SDAR.git
cd SDAR
git submodule update --init --recursive
cd third_party/JetEngine
pip install .

The following example shows how to quickly load a model with JetEngine and run a prompt end-to-end.

import os
from jetengine import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = os.path.expanduser("/path/to/your/model")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Initialize the LLM
llm = LLM(
    model_path,
    enforce_eager=True,
    tensor_parallel_size=1,
    mask_token_id=151669,   # Optional: only needed for masked/diffusion models
    block_length=4
)

# Set sampling/generation parameters
sampling_params = SamplingParams(
    temperature=1.0,
    topk=0,
    topp=1.0,
    max_tokens=256,
    remasking_strategy="low_confidence_dynamic",
    block_length=4,
    denoising_steps=4,
    dynamic_threshold=0.9
)

# Prepare a simple chat-style prompt
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain what reinforcement learning is in simple terms."}],
    tokenize=False,
    add_generation_prompt=True
)

# Generate text
outputs = llm.generate_streaming([prompt], sampling_params)

3. Using the prepared inference engine LMDeploy (For batch inference and production level speedup)

from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.pytorch.tools.utils import Timer, visualize_pipe_out


if __name__ == '__main__':
    model_path = 'JetLM/SDAR-8B-Chat'

    prompts = [
        [dict(role="user", content="Given the function $f(x) = \\frac{4x^2 - 4x + 4}{x^2 + 2x + 4}$, where $x \\in \\mathbb{R}$, determine its minimum value.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="If the domain of the function $\\log x^2$ is $x < a$ or $x > b$, for some $a$ and $b$, find $a + b$.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\nRemember to put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ are integers between $-100$ and $100$, inclusive, such that $12x^{2}-xy-6y^{2}=0$.\nRemember to put your final answer within \\boxed{}.\n")],
    ]

    backend_config = PytorchEngineConfig(
            tp=1,
            dtype="float16",
            max_prefill_token_num=4096,
            cache_max_entry_count=0.8,
            dllm_block_length=4,
            dllm_denoising_steps=4,
            dllm_unmasking_strategy="low_confidence_dynamic",
            dllm_confidence_threshold=0.9,
        )
    pipe = pipeline(model_path, backend_config=backend_config)

    gen_config = GenerationConfig(
        top_p=0.95,
        top_k=50,
        temperature=1.0,
        do_sample=False, # greedy decoding
        max_new_tokens=4096,
    )

    outputs = self.pipe(prompts, gen_config=gen_config)
    print(outputs.text)

📊 Preliminary Experiments

Part I: Scaling the Qwen3 Series with SDAR for General (Non-Reasoning) Tasks

Training Setup

We start from Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, and Qwen3-30B-A3B-Base.
Each model is continued-pretrained on 50B tokens (~0.14%) of relatively low-quality open-source data, followed by supervised fine-tuning (4B tokens).

The default model maintains a block size of 4 throughout its entire training process. For block size scaling, we use a block size of 4 during the continued pretraining phase, and directly increase it to the target block size (e.g., 8, 16, 32, or 64) during the SFT phase.

  • SDAR training: SDAR-1.7B-Chat / SDAR-4B-Chat / SDAR-8B-Chat / SDAR-30B-A3B-Chat.
  • AR training: Qwen3-1.7B-AR-SFT / Qwen3-30B-AR-SFT.

Evaluation Setup

  • Decoding
    • SDAR family: greedy decoding with block_length = 4, denoising_steps = 4.
    • AR baselines: greedy decoding.
  • Base model sources

Experiments of Performance

Table 1. Overall performance across general benchmarks. Benchmark results

Note

  • SDAR-1.7B-Chat is on par with Qwen3-1.7B-AR-SFT across most benchmarks.
  • SDAR-30B-A3B-Chat performs comparably to Qwen3-30B-AR-SFT.

Experiments of Efficiency

We compare SDAR-30B-A3B-Chat and Qwen3-30B-AR-SFT under static and dynamic decoding:

  • Static: each decoding step emits a fixed number of tokens, independent of confidence.
  • Dynamic: within a block, once the confidence exceeds a threshold $\theta$, the decoder generate multiple tokens at once (up to the block size).

Accuracy–speed trade-off Figure 1. Accuracy–speedup under static vs. dynamic inference; dynamic threshold sweeps relative to static.

Note

  • SDAR delivers >2× speedup over static inference with negligible accuracy loss; its static speed is comparable to AR models.
  • The speedup scales with model size, making SDAR increasingly favorable for larger models.

Part II: Applying SDAR to Qwen3-30B-MoE for Reasoning Benchmarks

Training Setup

We start from Qwen3-30B-A3B-Base and derive two science-oriented bases via large-scale pretraining and annealing, followed by reasoning SFT:

  1. 500B tokens (continual pretraining) + 500B tokens (annealing) → AR-30B-A3B-Sci-Base
  2. From the annealing corpus, sample 50B tokens and continue training with SDARSDAR-30B-A3B-Sci-Base
  3. Fine-tune both bases on reasoning datasets → AR-30B-A3B-Sci and SDAR-30B-A3B-Sci

Evaluation Setup

  • Decoding & inference.
    • AR: sampling decoding with temperature=0.6, top_p=0.95, top_k=20.
    • SDAR: block_length=4, denoising_steps=4; we report both (G) greedy and (S) sampling (temperature=1.0, top_p=1.0, top_k=0) decoding strategies.
  • Reporting protocol.
    Averages over 8 runs for GPQA and 32 runs for AIME 202, AIME 2025, and LMB-hard.
    Abbreviations: LMB = LiveMathBench, LCB = LiveCodeBench, (S) = sampling, (G) = greedy.

Experiments of Performance

1. Strict Experimental Comparison

Table 2. Strict comparison under identical backbones and datasets. Benchmarks on general reasoning, mathematics and code generation.

AR vs. SDAR on reasoning benchmarks

Table 3. Strict comparison under identical backbones and datasets. Benchmarks on scientific domains.

AR vs. SDAR on reasoning benchmarks

Note

SDAR-30B-A3B-Sci consistently outperforms AR-30B-A3B-Sci, with pronounced gains on science-focused tasks such as GPQA and ChemBench.

2. Comparison to External Open/Closed Models

We position SDAR-30B-A3B-Sci against leading open- and closed-source LLMs. External scores are taken from InternLM/Intern-S1.

Table 3. Positioning against external models (sources: InternLM/Intern-S1). SDAR vs. open/closed models

🗂️ Model Zoo

Model Type Link
SDAR-1.7B-Chat Chat huggingface.co/JetLM/SDAR-1.7B-Chat
SDAR-4B-Chat Chat huggingface.co/JetLM/SDAR-4B-Chat
SDAR-8B-Chat Chat huggingface.co/JetLM/SDAR-8B-Chat
SDAR-30B-A3B-Chat Chat huggingface.co/JetLM/SDAR-30B-A3B-Chat
SDAR-30B-A3B-Sci Thinking (Science) huggingface.co/JetLM/SDAR-30B-A3B-Sci

🚩 Roadmap

  • Release SDAR Technical Report
  • Release Inference Engine and Training Framework
  • More Features are working in progress

🤝 Core Contributors

  • Shuang Cheng: Initial idea proposal, model evaluation, and inference.
  • Yihan Bian: Engineering optimization, inference & training acceleration, MOE training code implementation.
  • Dawei Liu: Implementation of model training code, training experiments.
  • Biqing Qi: Project Leader and overall coordination.

Note

Note: This project is a collaborative effort, with all contributors solving challenges together.

For the full list of contributors, please refer to the author list in the citation. We are also deeply grateful to everyone who engaged in discussions and provided valuable feedback throughout the development of this project.

👏 Acknowledge

We would like to express our gratitude to the following works (MDLM, LLaDA, DiffuLLaMA, Block Diffusion) for providing important theoretical foundations and inspiration for SDAR.

📬 Contact

For issues or inquiries:

🔬 Citation

@misc{JetAstra2025,
  title={SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation},
  author={Shuang Cheng and Yihan Bian and Dawei Liu and Yuhua Jiang and Yihao Liu and Linfeng Zhang and Wenghai Wang and Qipeng Guo and Kai Chen and Biqing Qi* and Bowen Zhou},
  year={2025},
  institution={Shanghai AI Lab},
  url={https://github.com/JetAstra/SDAR}
}

⭐️ Star History

Star History Chart

About

SDAR (Synergy of Diffusion and AutoRegression), a large diffusion language model(1.7B, 4B, 8B, 30B)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages