Skip to content

Benchamarking #1353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion torchao/_models/llama/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,6 @@ def main(
float8_weight_only,
float8_dynamic_activation_float8_weight,
)
from torchao.prototype.quantization.autoquant_v2 import autoquant_v2
from torchao.utils import unwrap_tensor_subclass

from torchao.quantization.granularity import PerTensor, PerRow
Expand Down Expand Up @@ -297,6 +296,29 @@ def main(
dtype = _NBITS_TO_DTYPE[nbits]
group_size = int(_quant_args[2])
quantize_(model, uintx_weight_only(dtype, group_size, use_hqq=use_hqq))
elif "int8_dynamic_activation_intx_weight" in quantization:
from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
assert precision == torch.float32, "int8_dynamic_activation_intx_weight requires fp32 precision"

# Build kernels in temp location, and load them in torch
# This requires an ARM CPU
from torchao.experimental.temp_build import temp_build_and_load_torchao_ops
temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are able to ship these kernels as part of pip package, we may not need this

what is the blocker for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to move out of the experimental. I think we need to follow up on this. I believe we have sufficient evidence now? @supriyar ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont follow why we are doing this way? Why can we not just try to load the ops and if not found raise exception with build steps needed? Ideally we can detect the platform and build this deps as part of some pre-req for running benchmarks under _model directory. When we are able to ship these kernels as part of pip package, we may not need this

I guess by "just try to load the ops" you mean something like this: https://github.com/pytorch/executorch/blob/main/extension/llm/custom_ops/sdpa_with_kv_cache.py#L21-L34 (but with the except block replaced by install instructions)

We could do that, but in the current setup, won't the try block always fail when running this benchmarking script (unless we build/load the ops in setup.py)? I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line. I guess we could ask them to define an environment variable with the library location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but with the except block replaced by install instructions

I'm also not sure what the instructions would say to make the script runnable without telling the user to modify the script by adding a torch.load_library line.

@metascroy Yes, but I dont follow the second part. Why user needs to modify the script. Can we not just import torchao.experimental.lowbit_ops which internally does try/except? But I do kind of get what you are doing here because any build scripts will ahve to figure out where to put the build artifact (.so) and then we need to load from there.

Ideally it should be installed as part of the setup instructions or pip package. So we can also follow something like https://github.com/pytorch/ao/blob/main/setup.py#L53 and add extra option to build with experimental lowbit quant features. So if user invokes the benchmarking script without building experimental kernels you can suggest please do python setup.py --use_experimental_low_bit_kernels or pip install . --use_experimental_low_bit_kernels. This feels a bit cleaner to me but curious to know your thoughts and also from @msaroufim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is whether we can reuse the cmake setup stuff we already have (e.g., this function that sets up the parallel compute: https://github.com/pytorch/ao/blob/main/torchao/experimental/Utils.cmake#L7)? If we bring in KleidiAI/CPUInfo via CMake, that will be more stuff to worry about.

I haven't used the torch CppExtension in setup.py, but it looks fairly simplistic compared to cmake. Perhaps we could do something like what this blog does, if it does not already exist in PyTorch: https://martinopilia.com/posts/2018/09/15/building-python-extension.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but Executorch's pybinding extension with xnnpack builds xnnpack which build with cpuinfo and pthreadpool and everything. Granted that is a whole lot to build but it does work in ET. So likely there is a nuance that you are worried about that I am not understanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this goes back to one of my earlier comments to @jerryzh168: the current setup.py in torchao does not really support cmake, and it would require a good amount of refactoring to support it. Currently setup.py in torchao is built on utilities in torch.utils.cpp_extension, which look somewhat simplistic and as far as I can tell, do not support cmake.

ET's setup.py defines a custom extension to support cmake and doing something similar to torchao looks like a sizable refactor to their setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will have to rely on your answer for this since I havent looked at all the details of cpp_extension. I do remember it being simple but I dont know if there are ways to add deps as part of cpp_extension. I doubt but see if possible.

If not, I think it is worth proposing this for the sake of making ARM kernels available on mac builds. @drisspg any thoughts? discussion here is largely around more complex setup.py in order to allow us to build cpp package extensions that package cpu kernels and make it available as part of python package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited setup.py to build the experimental kernels here: D67777662

Need feedback from torchao folks on whether the changes are acceptable.


# Quantize model
_quant_args = quantization.split("-")
nbit = int(_quant_args[1])
assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8"
group_size = int(_quant_args[2])
has_weight_zeros = bool(_quant_args[3])
quantize_(
model,
int8_dynamic_activation_intx_weight(
group_size=group_size,
nbit=nbit,
has_weight_zeros=has_weight_zeros,
),
)
elif "float8wo" in quantization:
quantize_(model, float8_weight_only())
elif "float8dq" in quantization:
Expand All @@ -309,6 +331,7 @@ def main(
granularity = PerTensor()
quantize_(model, float8_dynamic_activation_float8_weight(granularity=granularity))
elif "autoquant_v2" in quantization:
from torchao.prototype.quantization.autoquant_v2 import autoquant_v2
from torchao._models._eval import InputRecorder
from torchao._models.llama.model import prepare_inputs_for_model

Expand Down
43 changes: 43 additions & 0 deletions torchao/experimental/temp_build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import glob
import subprocess
import tempfile
import torch

def cmake_build_torchao_ops(cmake_lists_path, temp_build_dir):
from distutils.sysconfig import get_python_lib
print("Building torchao ops for ATen target")
cmake_prefix_path = get_python_lib()
subprocess.run(
[
"cmake",
"-DCMAKE_PREFIX_PATH=" + cmake_prefix_path,
"-DCMAKE_INSTALL_PREFIX=" + temp_build_dir.name,
"-S " + cmake_lists_path,
"-B " + temp_build_dir.name,
]
)
subprocess.run(
[
"cmake",
"--build",
temp_build_dir.name,
"-j 16",
"--target install",
"--config Release",
]
)

def temp_build_and_load_torchao_ops(cmake_lists_path):
temp_build_dir = tempfile.TemporaryDirectory()
cmake_build_torchao_ops(cmake_lists_path, temp_build_dir)
libs = glob.glob(f"{temp_build_dir.name}/lib/libtorchao_ops_aten.*")
libs = list(filter(lambda l: (l.endswith("so") or l.endswith("dylib")), libs))
assert len(libs) == 1
torch.ops.load_library(libs[0])
print(f"TorchAO ops are loaded from {libs[0]}")
9 changes: 9 additions & 0 deletions torchao/quantization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,16 @@ We're trying to develop kernels for low bit quantization for intx quantization f

You try can out these apis with the `quantize_` api as above alongside the constructor `uintx_weight_only` an example can be found in in `torchao/_models/llama/generate.py`.

### int8_dynamic_activation_intx_weight Quantization
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.

| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ------------- | -------------------------------------------------| --------------| ------------------------| ---------------- | ----------------|
| Llama-3.1-8B | Base (bfloat16) | 1.24 | 18.62 | NA | 15.01 |
| | int8_dynamic_activation_intx_weight-4-256-false | 16.03 | 65.81 | NA | 4.11 |
| | int8_dynamic_activation_intx_weight-3-256-false | 18.94 | 59.97 | NA | 3.17 |

You try can out these apis with the `quantize_` api as above alongside the constructor `int8_dynamic_activation_intx_weight`. An example can be found in `torchao/_models/llama/generate.py`.

### Automatic Inductor Configuration
The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues.
Expand Down
Loading