Add support for fbgemm int4 mm kernel #2255

jerryzh168 · 2025-05-23T20:07:25Z

Summary:
we also plan to expose some other kernels like fp8xint4 and bf16xfp8, fp8xfp8 to compare with existing torchao kernels

Test Plan:
test/dtypes/test_fbgemm_int4_tensor.py

H100, with compile:

	overall tokens/sec	TTFT	Peak Memory	Model Size
baseline - 1	131.65	0.0220	16.24 GB	15.01 GB
baseline - 128	76.38	0.0544	26.92 GB	15.01 GB
int4wo - 1	207.69	0.0288	6.41 GB	3.99 GB
int4wo - 128	12.85	0.4223	16.01 GB	3.99 GB
fbgemm-int4 - 1 (no compile)	40.00	0.0508	29.03 GB	4.22 GB
fbgemm-int4 - 128 (no compile)	11.46	0.0846	28.96 GB	4.22 GB

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder
export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B-Instruct
# default batch size 1
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --quantization int4wo-128 --write_result benchmark_results.txt
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization fbgemm-int4-128 --write_result benchmark_results.txt

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --batch_size 128
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --quantization int4wo-128 --write_result benchmark_results.txt --batch_size 128
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization fbgemm-int4-128 --write_result benchmark_results.txt --batch_size 128

Note: fbgemm-int4-128 does not work with compile yet since the fbgemm op does not have meta device implementation.

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-05-23T20:07:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2255

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d2066dc with merge base b0cfeec ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

samanamp · 2025-05-23T21:41:02Z

Thank you! community really needs this.

torchao/dtypes/fbgemm_int4_tensor.py

torchao/quantization/quant_api.py

torchao/dtypes/fbgemm_int4_tensor.py

torchao/utils.py

drisspg

Okay everything looks pretty good but the API for the FBGEMM config feels gross imo I know its a thin wrapper around their op but I think we can do better than io string

drisspg

Looks good, can you also add a serialization test entry, want to ensure we can seralize str eums

torchao/quantization/quant_api.py

Summary: we also plan to expose some other kernels like fp8xint4 and bf16xfp8, fp8xfp8 to compare with existing torchao kernels Test Plan: test/dtypes/test_fbgemm_int4_tensor.py Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 23, 2025

jerryzh168 force-pushed the fbgemm-bf16-int4 branch from 9df9b49 to 3253e6a Compare May 27, 2025 05:26

jerryzh168 added the topic: new feature Use this tag if this PR adds a new feature label May 27, 2025

jerryzh168 requested a review from drisspg May 27, 2025 20:57

drisspg reviewed May 27, 2025

View reviewed changes

torchao/dtypes/fbgemm_int4_tensor.py Outdated Show resolved Hide resolved

drisspg reviewed May 27, 2025

View reviewed changes

torchao/quantization/quant_api.py Outdated Show resolved Hide resolved

drisspg reviewed May 27, 2025

View reviewed changes

torchao/quantization/quant_api.py Outdated Show resolved Hide resolved

drisspg reviewed May 27, 2025

View reviewed changes

torchao/dtypes/fbgemm_int4_tensor.py Outdated Show resolved Hide resolved

jerryzh168 requested a review from drisspg May 28, 2025 00:07