Skip to content

WebNN should support NPU and QDQ operations #623

@wchao1115

Description

@wchao1115

Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:

  1. An ability to specify a device type for the NPU. Unlike more general-purpose devices such as the GPU and CPU, an NPU supports a limited finite set of operations without programmability support. To an extent needed to keep model execution stable and more predictable, the notion of a fallback device is needed to support NPU acceleration during model inference.
  2. A minimum set of operators required to support quantized models. Because most NPU utilizes a much simpler and less power-hungry low-bit integer arithmetic units, models targeting the NPU almost always need to be quantized first. The bare minimal support here in terms of operators are just two -- the quantizeLinear and dequantizeLinear operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e. conv2dInt and matmulInt.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions