-
Notifications
You must be signed in to change notification settings - Fork 52
Open
Description
Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:
- An ability to specify a device type for the NPU. Unlike more general-purpose devices such as the GPU and CPU, an NPU supports a limited finite set of operations without programmability support. To an extent needed to keep model execution stable and more predictable, the notion of a fallback device is needed to support NPU acceleration during model inference.
- A minimum set of operators required to support quantized models. Because most NPU utilizes a much simpler and less power-hungry low-bit integer arithmetic units, models targeting the NPU almost always need to be quantized first. The bare minimal support here in terms of operators are just two -- the
quantizeLinearanddequantizeLinearoperators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e.conv2dIntandmatmulInt.
fdwr, anssiko, zolkis, huningxin, wacky6 and 2 more