mindspore-lab
diff --git a/‎docs/diffusers/_toctree.yml‎
Lines changed: 2 additions & 0 deletions b/‎docs/diffusers/_toctree.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/diffusers/api/pipelines/prx.md‎
Lines changed: 56 additions & 0 deletions b/‎docs/diffusers/api/pipelines/prx.md‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎mindone/diffusers/__init__.py‎
Lines changed: 4 additions & 0 deletions b/‎mindone/diffusers/__init__.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎mindone/diffusers/models/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎mindone/diffusers/models/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎mindone/diffusers/models/attention_dispatch.py‎
Lines changed: 97 additions & 0 deletions b/‎mindone/diffusers/models/attention_dispatch.py‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎mindone/diffusers/models/transformers/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎mindone/diffusers/models/transformers/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -479,6 +479,8 @@
       title: PixArt-α
     - local: api/pipelines/pixart_sigma
       title: PixArt-Σ
+    - local: api/pipelines/prx
+      title: PRX
     - local: api/pipelines/qwenimage
       title: QwenImage
     - local: api/pipelines/sana
 
@@ -0,0 +1,56 @@
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+
+# PRX
+
+
+PRX generates high-quality images from text using a simplified MMDIT architecture where text tokens don't update through transformer blocks. It employs flow matching with discrete scheduling for efficient sampling and uses Google's T5Gemma-2B-2B-UL2 model for multi-language text encoding. The ~1.3B parameter transformer delivers fast inference without sacrificing quality. You can choose between Flux VAE (8x compression, 16 latent channels) for balanced quality and speed or DC-AE (32x compression, 32 latent channels) for latent compression and faster processing.
+
+## Available models
+
+PRX offers multiple variants with different VAE configurations, each optimized for specific resolutions. Base models excel with detailed prompts, capturing complex compositions and subtle details. Fine-tuned models trained on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) improve aesthetic quality, especially with simpler prompts.
+
+
+| Model | Resolution | Fine-tuned | Distilled | Description | Suggested prompts | Suggested parameters | Recommended dtype |
+|:-----:|:-----------------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| [`Photoroom/prx-256-t2i`](https://huggingface.co/Photoroom/prx-256-t2i)| 256 | No | No | Base model pre-trained at 256 with Flux VAE|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-256-t2i-sft`](https://huggingface.co/Photoroom/prx-256-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i`](https://huggingface.co/Photoroom/prx-512-t2i)| 512 | No | No | Base model pre-trained at 512 with Flux VAE |Works best with detailed prompts in natural language|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i-sft`](https://huggingface.co/Photoroom/prx-512-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/prx-512-t2i-sft`](https://huggingface.co/Photoroom/prx-512-t2i-sft) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i-dc-ae`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae)| 512 | No | No | Base model pre-trained at 512 with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae)|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i-dc-ae-sft`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae) | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `mindspore.bfloat16` |
+| [`Photoroom/prx-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/prx-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft-distilled) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `mindspore.bfloat16` |s
+
+Refer to [this](https://huggingface.co/collections/Photoroom/prx-models-68e66254c202ebfab99ad38e) collection for more information.
+
+## Loading the pipeline
+
+Load the pipeline with [`~DiffusionPipeline.from_pretrained`].
+
+```py
+import mindspore as ms
+from mindone.diffusers.pipelines.prx import PRXPipeline
+
+# Load pipeline - VAE and text encoder will be loaded from HuggingFace
+pipe = PRXPipeline.from_pretrained("Photoroom/prx-512-t2i-sft", mindspore_dtype=ms.bfloat16)
+
+prompt = "A front-facing portrait of a lion the golden savanna at sunset."
+image = pipe(prompt, num_inference_steps=28, guidance_scale=5.0).images[0]
+image.save("prx_output.png")
+```
+
+::: mindone.diffusers.PRXPipeline
+
+::: mindone.diffusers.pipelines.prx.pipeline_output.PRXPipelineOutput
@@ -106,6 +106,7 @@
         "OmniGenTransformer2DModel",
         "PixArtTransformer2DModel",
         "PriorTransformer",
+        "PRXTransformer2DModel",
         "QwenImageTransformer2DModel",
         "SanaControlNetModel",
         "SanaTransformer2DModel",
@@ -263,6 +264,7 @@
         "PixArtAlphaPipeline",
         "PixArtSigmaPAGPipeline",
         "PixArtSigmaPipeline",
+        "PRXPipeline",
         "QwenImageImg2ImgPipeline",
         "QwenImageInpaintPipeline",
         "QwenImagePipeline",
@@ -487,6 +489,7 @@
         OmniGenTransformer2DModel,
         PixArtTransformer2DModel,
         PriorTransformer,
+        PRXTransformer2DModel,
         QwenImageTransformer2DModel,
         SanaControlNetModel,
         SanaTransformer2DModel,
@@ -655,6 +658,7 @@
         PixArtAlphaPipeline,
         PixArtSigmaPAGPipeline,
         PixArtSigmaPipeline,
+        PRXPipeline,
         QwenImageEditInpaintPipeline,
         QwenImageEditPipeline,
         QwenImageImg2ImgPipeline,
 
@@ -83,6 +83,7 @@
     "transformers.transformer_lumina2": ["Lumina2Transformer2DModel"],
     "transformers.transformer_mochi": ["MochiTransformer3DModel"],
     "transformers.transformer_omnigen": ["OmniGenTransformer2DModel"],
+    "transformers.transformer_prx": ["PRXTransformer2DModel"],
     "transformers.transformer_qwenimage": ["QwenImageTransformer2DModel"],
     "transformers.transformer_sd3": ["SD3Transformer2DModel"],
     "transformers.transformer_skyreels_v2": ["SkyReelsV2Transformer3DModel"],
@@ -167,6 +168,7 @@
         OmniGenTransformer2DModel,
         PixArtTransformer2DModel,
         PriorTransformer,
+        PRXTransformer2DModel,
         QwenImageTransformer2DModel,
         SanaTransformer2DModel,
         SD3Transformer2DModel,
 
@@ -0,0 +1,97 @@
+import math
+from typing import Optional
+
+import mindspore as ms
+from mindspore import mint, ops
+
+
+def dispatch_attention_fn(
+    query: ms.Tensor,
+    key: ms.Tensor,
+    value: ms.Tensor,
+    attn_mask: Optional[ms.Tensor] = None,
+    dropout_p: float = 0.0,
+    is_causal: bool = False,
+    scale: Optional[float] = None,
+):
+    query, key, value = (x.permute(0, 2, 1, 3) for x in (query, key, value))
+    # Note: PyTorch's SDPA and MindSpore's FA handle `attention_mask` slightly differently.
+    # In PyTorch, if the mask is not boolean (e.g., float32 with 0/1 values), it is interpreted
+    # as an additive bias: `attn_bias = attn_mask + attn_bias`.
+    # This implicit branch may lead to issues if the pipeline mistakenly provides
+    # a 0/1 float mask instead of a boolean mask.
+    # While this behavior is consistent with HF Diffusers for now,
+    # it may still be a potential bug source worth validating.
+    if attn_mask is not None and attn_mask.dtype != ms.bool_ and 1.0 in attn_mask:
+        L, S = query.shape[-2], key.shape[-2]
+        scale_factor = 1 / math.sqrt(query.shape[-1]) if scale is None else scale
+        attn_bias = mint.zeros((L, S), dtype=query.dtype)
+        if is_causal:
+            if attn_mask is not None:
+                if attn_mask.dtype == ms.bool_:
+                    attn_mask = mint.logical_and(attn_mask, mint.ones((L, S), dtype=ms.bool_).tril(diagonal=0))
+                else:
+                    attn_mask = attn_mask + mint.triu(
+                        mint.full((L, S), float("-inf"), dtype=attn_mask.dtype), diagonal=1
+                    )
+            else:
+                temp_mask = mint.ones((L, S), dtype=ms.bool_).tril(diagonal=0)
+                attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
+                attn_bias = attn_bias.to(query.dtype)
+
+        if attn_mask is not None:
+            if attn_mask.dtype == ms.bool_:
+                attn_bias = attn_bias.masked_fill(attn_mask.logical_not(), float("-inf"))
+            else:
+                attn_bias = attn_mask + attn_bias
+
+        attn_weight = mint.matmul(query, key.swapaxes(-2, -1)) * scale_factor
+        attn_weight += attn_bias
+        attn_weight = mint.softmax(attn_weight, dim=-1)
+        attn_weight = ops.dropout(attn_weight, dropout_p, training=True)
+        return mint.matmul(attn_weight, value).permute(0, 2, 1, 3)
+
+    if query.dtype in (ms.float16, ms.bfloat16):
+        out = flash_attention_op(query, key, value, attn_mask, keep_prob=1 - dropout_p, scale=scale)
+    else:
+        out = flash_attention_op(
+            query.to(ms.float16),
+            key.to(ms.float16),
+            value.to(ms.float16),
+            attn_mask,
+            keep_prob=1 - dropout_p,
+            scale=scale,
+        ).to(query.dtype)
+    return out.permute(0, 2, 1, 3)
+
+
+def flash_attention_op(
+    query: ms.Tensor,
+    key: ms.Tensor,
+    value: ms.Tensor,
+    attn_mask: Optional[ms.Tensor] = None,
+    keep_prob: float = 1.0,
+    scale: Optional[float] = None,
+):
+    # For most scenarios, qkv has been processed into a BNSD layout before sdp
+    input_layout = "BNSD"
+    head_num = query.shape[1]
+    if scale is None:
+        scale = query.shape[-1] ** (-0.5)
+
+    # In case qkv is 3-dim after `head_to_batch_dim`
+    if query.ndim == 3:
+        input_layout = "BSH"
+        head_num = 1
+
+    # process `attn_mask` as logic is different between PyTorch and Mindspore
+    # In MindSpore, False indicates retention and True indicates discard, in PyTorch it is the opposite
+    if attn_mask is not None:
+        attn_mask = mint.logical_not(attn_mask) if attn_mask.dtype == ms.bool_ else attn_mask.bool()
+        attn_mask = mint.broadcast_to(
+            attn_mask, (attn_mask.shape[0], attn_mask.shape[1], query.shape[-2], key.shape[-2])
+        )[:, :1, :, :]
+
+    return ops.operations.nn_ops.FlashAttentionScore(
+        head_num=head_num, keep_prob=keep_prob, scale_value=scale, input_layout=input_layout
+    )(query, key, value, None, None, None, attn_mask)[3]
@@ -28,6 +28,7 @@
 from .transformer_lumina2 import Lumina2Transformer2DModel
 from .transformer_mochi import MochiTransformer3DModel
 from .transformer_omnigen import OmniGenTransformer2DModel
+from .transformer_prx import PRXTransformer2DModel
 from .transformer_qwenimage import QwenImageTransformer2DModel
 from .transformer_sd3 import SD3Transformer2DModel
 from .transformer_skyreels_v2 import SkyReelsV2Transformer3DModel