-
Notifications
You must be signed in to change notification settings - Fork 26
Add qlinear_packing_format options #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: The 4w_hqq schema uses the following torchao API,
```
Int4WeightOnlyConfig(
group_size=32,
int4_packing_format="tile_packed_to_4d",
int4_choose_qparams_algorithm="hqq",
)
```
After a linear layer is quantized with 4w_hqq, it will use aten._weight_int4pack_mm to perform matrix multiplication.
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030
|
cc @metascroy |
| weight_dtype=torch.int4, | ||
| granularity=linear_weight_granularity, | ||
| ), | ||
| "4w_hqq": Int4WeightOnlyConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which config should we use for weight only 4bit (for desktop/laptop on CUDA/Metal backends)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the CUDA/Metal backends in ET designed? Are they just a thin wrapper around AOTI, or do they do some graph passes before lowering?
If you use IntxWeightOnlyConfig, you'll get quantization represented as a dequant op followed by a linear op, and this pattern can be matched/consumed by backends. In this way, a generic quantization config can serve multiple ET backends. This is how XNNPACK/Vulkan/CoreML work today with quantize_.
And ET's Cuda/Metal backends could theoretically also do this as well.
The code here is not a generic one, but specific to CUDA (see the "tile_packed_to_4d" format). And after export it will reference CUDA kernels. It is functional, but is not how other ET backends work. At the very least, we should put cuda somewhere in the qmode name, e.g., "4w_hqq_cuda".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current config looks good, this is targeting tinygemm kernel in CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for metal, seems like it is calling into the same aten op: https://github.com/pytorch/pytorch/blob/70ec464c1608116df6d379e097f9149b22407456/aten/src/ATen/native/native_functions.yaml#L4218
but we haven't tested this one in torchao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think if we're taking this route, the qmode needs to reference cuda/metal in some way. It cannot be a generic "4w_hqq".
We can make qmode="cuda:4w" and qmode="metal:4w" match the same config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works. If we do that, we should error out if user pass in invalid combination like tile_packed_to_4d + xnnpack etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an arg to configure a specify quantization scheme is a bit overkill in my opinion, I'd like to keep the CLI clean so it doesn't become like export_llama.
qmode="cuda:4w" and qmode="metal:4w"
I'd prefer Scott's suggestion here, for other quant formats we express it in the arg e.g. cuda:4w-<format>
Either that or we have a more general qlinear_packing_format arg and we do some validation that it's used appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer cuda:4w or metal:4w for qlinear (maybe with -) for these device-specific formats.
IntxWeightOnlyConfig also works for CUDA
This is true, but using the packed format is more optimized for 4-bit. You could still access it with "4w" if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not the best person to design what this qmode string should look like, e.g. there is also a qparams_algorithm config which potentially can be a part of the qmode string. Maybe there are some other configs that I am not aware of. So I decided to go with adding a qlinear_packing_format option. Someone can consolidate things later to add qmode if they want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@desertfire can you do a general qlinear packing format? --qlinear_packing_format instead of qlinear_int4_packing_format
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a0c94a0 Pull Request resolved: #15030
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: 29b5b16 Pull Request resolved: #15030
| eager_model: torch.nn.Module, | ||
| qlinear_config: Optional[str] = None, | ||
| qlinear_group_size: Optional[int] = 32, | ||
| qlinear_int4_packing_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add qlinear_encoder_int4_packing_format maybe? Not critical if you are not using it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just following the existing convention. In optimum/exporters/executorch/tasks/multimodal_text_to_text.py, quantize_model_ is called separately for the decoder and encoder, so no need to add another qlinear_encoder_int4_packing_format.
| quantize_model_( | ||
| eager_model, | ||
| qlinear_config=qlinear_config, | ||
| qlinear_int4_packing_format=qlinear_int4_packing_format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make more sense to merge this into qlinear_config itself? and just create a config that's used in torchao quantize_ function? e.g. Int4WeightOnlyConfig(..., int4_packing_format=qlinear_int4_packing_format)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nvm, seems like you are doing this in quantize_model_ itself
Summary: Using --qlinear_packing_format to apply the following torchao API,
This enables aten._weight_int4pack_mm for matrix multiplication.