Skip to content

Conversation

@desertfire
Copy link
Contributor

@desertfire desertfire commented Oct 10, 2025

Summary: Using --qlinear_packing_format to apply the following torchao API,

Int4WeightOnlyConfig(
    group_size=32,
    int4_packing_format="tile_packed_to_4d",
    int4_choose_qparams_algorithm="hqq",
)

This enables aten._weight_int4pack_mm for matrix multiplication.

Summary: The 4w_hqq schema uses the following torchao API,
```
Int4WeightOnlyConfig(
    group_size=32,
    int4_packing_format="tile_packed_to_4d",
    int4_choose_qparams_algorithm="hqq",
)
```
After a linear layer is quantized with 4w_hqq, it will use aten._weight_int4pack_mm to perform matrix multiplication.
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 10, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

[ghstack-poisoned]
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 10, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: a543a05
Pull Request resolved: #15030
@mergennachin
Copy link
Collaborator

cc @metascroy

weight_dtype=torch.int4,
granularity=linear_weight_granularity,
),
"4w_hqq": Int4WeightOnlyConfig(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metascroy @jerryzh168

which config should we use for weight only 4bit (for desktop/laptop on CUDA/Metal backends)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the CUDA/Metal backends in ET designed? Are they just a thin wrapper around AOTI, or do they do some graph passes before lowering?

If you use IntxWeightOnlyConfig, you'll get quantization represented as a dequant op followed by a linear op, and this pattern can be matched/consumed by backends. In this way, a generic quantization config can serve multiple ET backends. This is how XNNPACK/Vulkan/CoreML work today with quantize_.

And ET's Cuda/Metal backends could theoretically also do this as well.

The code here is not a generic one, but specific to CUDA (see the "tile_packed_to_4d" format). And after export it will reference CUDA kernels. It is functional, but is not how other ET backends work. At the very least, we should put cuda somewhere in the qmode name, e.g., "4w_hqq_cuda".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current config looks good, this is targeting tinygemm kernel in CUDA

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for metal, seems like it is calling into the same aten op: https://github.com/pytorch/pytorch/blob/70ec464c1608116df6d379e097f9149b22407456/aten/src/ATen/native/native_functions.yaml#L4218

but we haven't tested this one in torchao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think if we're taking this route, the qmode needs to reference cuda/metal in some way. It cannot be a generic "4w_hqq".

We can make qmode="cuda:4w" and qmode="metal:4w" match the same config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works. If we do that, we should error out if user pass in invalid combination like tile_packed_to_4d + xnnpack etc.

Copy link
Collaborator

@jackzhxng jackzhxng Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an arg to configure a specify quantization scheme is a bit overkill in my opinion, I'd like to keep the CLI clean so it doesn't become like export_llama.

qmode="cuda:4w" and qmode="metal:4w"

I'd prefer Scott's suggestion here, for other quant formats we express it in the arg e.g. cuda:4w-<format>

Either that or we have a more general qlinear_packing_format arg and we do some validation that it's used appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd prefer cuda:4w or metal:4w for qlinear (maybe with -) for these device-specific formats.

IntxWeightOnlyConfig also works for CUDA

This is true, but using the packed format is more optimized for 4-bit. You could still access it with "4w" if you want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not the best person to design what this qmode string should look like, e.g. there is also a qparams_algorithm config which potentially can be a part of the qmode string. Maybe there are some other configs that I am not aware of. So I decided to go with adding a qlinear_packing_format option. Someone can consolidate things later to add qmode if they want to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desertfire can you do a general qlinear packing format? --qlinear_packing_format instead of qlinear_int4_packing_format

desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: a0c94a0
Pull Request resolved: #15030
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit to pytorch/executorch that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: 29b5b16
Pull Request resolved: #15030
@desertfire desertfire changed the title Add a 4w_hqq quantization schema Add int4_packing_format options Oct 13, 2025
eager_model: torch.nn.Module,
qlinear_config: Optional[str] = None,
qlinear_group_size: Optional[int] = 32,
qlinear_int4_packing_format: Optional[str] = None,
Copy link
Collaborator

@larryliu0820 larryliu0820 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add qlinear_encoder_int4_packing_format maybe? Not critical if you are not using it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just following the existing convention. In optimum/exporters/executorch/tasks/multimodal_text_to_text.py, quantize_model_ is called separately for the decoder and encoder, so no need to add another qlinear_encoder_int4_packing_format.

quantize_model_(
eager_model,
qlinear_config=qlinear_config,
qlinear_int4_packing_format=qlinear_int4_packing_format,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to merge this into qlinear_config itself? and just create a config that's used in torchao quantize_ function? e.g. Int4WeightOnlyConfig(..., int4_packing_format=qlinear_int4_packing_format)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nvm, seems like you are doing this in quantize_model_ itself

@desertfire desertfire changed the title Add int4_packing_format options Add qlinear_packing_format options Oct 14, 2025
@larryliu0820 larryliu0820 merged commit 09fdbd0 into huggingface:main Oct 14, 2025
59 of 83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants