Add qlinear_packing_format options #164

desertfire · 2025-10-10T22:56:38Z

Summary: Using --qlinear_packing_format to apply the following torchao API,

Int4WeightOnlyConfig(
    group_size=32,
    int4_packing_format="tile_packed_to_4d",
    int4_choose_qparams_algorithm="hqq",
)

This enables aten._weight_int4pack_mm for matrix multiplication.

Summary: The 4w_hqq schema uses the following torchao API, ``` Int4WeightOnlyConfig( group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq", ) ``` After a linear layer is quantized with 4w_hqq, it will use aten._weight_int4pack_mm to perform matrix multiplication.

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030

mergennachin · 2025-10-12T19:08:44Z

cc @metascroy

mergennachin · 2025-10-12T19:20:09Z

optimum/exporters/executorch/quantization.py

                weight_dtype=torch.int4,
                granularity=linear_weight_granularity,
            ),
+            "4w_hqq": Int4WeightOnlyConfig(


@metascroy @jerryzh168

which config should we use for weight only 4bit (for desktop/laptop on CUDA/Metal backends)?

How are the CUDA/Metal backends in ET designed? Are they just a thin wrapper around AOTI, or do they do some graph passes before lowering?

If you use IntxWeightOnlyConfig, you'll get quantization represented as a dequant op followed by a linear op, and this pattern can be matched/consumed by backends. In this way, a generic quantization config can serve multiple ET backends. This is how XNNPACK/Vulkan/CoreML work today with quantize_.

And ET's Cuda/Metal backends could theoretically also do this as well.

The code here is not a generic one, but specific to CUDA (see the "tile_packed_to_4d" format). And after export it will reference CUDA kernels. It is functional, but is not how other ET backends work. At the very least, we should put cuda somewhere in the qmode name, e.g., "4w_hqq_cuda".

the current config looks good, this is targeting tinygemm kernel in CUDA

for metal, seems like it is calling into the same aten op: https://github.com/pytorch/pytorch/blob/70ec464c1608116df6d379e097f9149b22407456/aten/src/ATen/native/native_functions.yaml#L4218

but we haven't tested this one in torchao

I still think if we're taking this route, the qmode needs to reference cuda/metal in some way. It cannot be a generic "4w_hqq".

We can make qmode="cuda:4w" and qmode="metal:4w" match the same config.

That works. If we do that, we should error out if user pass in invalid combination like tile_packed_to_4d + xnnpack etc.

Having an arg to configure a specify quantization scheme is a bit overkill in my opinion, I'd like to keep the CLI clean so it doesn't become like export_llama.

qmode="cuda:4w" and qmode="metal:4w"

I'd prefer Scott's suggestion here, for other quant formats we express it in the arg e.g. cuda:4w-<format>

Either that or we have a more general qlinear_packing_format arg and we do some validation that it's used appropriately.

Yeah, I'd prefer cuda:4w or metal:4w for qlinear (maybe with -) for these device-specific formats.

IntxWeightOnlyConfig also works for CUDA

This is true, but using the packed format is more optimized for 4-bit. You could still access it with "4w" if you want

I am not the best person to design what this qmode string should look like, e.g. there is also a qparams_algorithm config which potentially can be a part of the qmode string. Maybe there are some other configs that I am not aware of. So I decided to go with adding a qlinear_packing_format option. Someone can consolidate things later to add qmode if they want to.

@desertfire can you do a general qlinear packing format? --qlinear_packing_format instead of qlinear_int4_packing_format

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a0c94a0 Pull Request resolved: #15030

optimum/commands/export/executorch.py

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: 29b5b16 Pull Request resolved: #15030

larryliu0820 · 2025-10-14T16:47:05Z

optimum/exporters/executorch/quantization.py

    eager_model: torch.nn.Module,
    qlinear_config: Optional[str] = None,
    qlinear_group_size: Optional[int] = 32,
+    qlinear_int4_packing_format: Optional[str] = None,


Add qlinear_encoder_int4_packing_format maybe? Not critical if you are not using it though.

I am just following the existing convention. In optimum/exporters/executorch/tasks/multimodal_text_to_text.py, quantize_model_ is called separately for the decoder and encoder, so no need to add another qlinear_encoder_int4_packing_format.

jerryzh168 · 2025-10-14T17:20:40Z

optimum/exporters/executorch/tasks/causal_lm.py

+    quantize_model_(
+        eager_model,
+        qlinear_config=qlinear_config,
+        qlinear_int4_packing_format=qlinear_int4_packing_format,


would it make more sense to merge this into qlinear_config itself? and just create a config that's used in torchao quantize_ function? e.g. Int4WeightOnlyConfig(..., int4_packing_format=qlinear_int4_packing_format)

oh nvm, seems like you are doing this in quantize_model_ itself

desertfire mentioned this pull request Oct 10, 2025

Support aoti_torch_cuda__weight_int4pack_mm pytorch/executorch#15030

Merged

mergennachin requested review from jackzhxng, larryliu0820 and mergennachin October 12, 2025 14:46

mergennachin reviewed Oct 12, 2025

View reviewed changes

jackzhxng reviewed Oct 13, 2025

View reviewed changes

optimum/commands/export/executorch.py Outdated Show resolved Hide resolved

desertfire added 2 commits October 13, 2025 12:13

Rever 4w_hqq. Introduce another --qlinear_int4_packing_format option

cd2b067

Add a qlinear_int4_packing_format option

139119d

desertfire changed the title ~~Add a 4w_hqq quantization schema~~ Add int4_packing_format options Oct 13, 2025

larryliu0820 reviewed Oct 14, 2025

View reviewed changes

larryliu0820 approved these changes Oct 14, 2025

View reviewed changes

jerryzh168 reviewed Oct 14, 2025

View reviewed changes

Drop int4 in the qlinear_packing_format naming

918a090

desertfire changed the title ~~Add int4_packing_format options~~ Add qlinear_packing_format options Oct 14, 2025

Minor fixes

af60653

larryliu0820 merged commit 09fdbd0 into huggingface:main Oct 14, 2025
59 of 83 checks passed

Add qlinear_packing_format options #164

Add qlinear_packing_format options #164

Uh oh!

Conversation

desertfire commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Oct 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackzhxng Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

larryliu0820 Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

desertfire commented Oct 10, 2025 •

edited

Loading

jackzhxng Oct 13, 2025 •

edited

Loading

larryliu0820 Oct 14, 2025 •

edited

Loading