Add device preference use cases to explainer #855

anssiko · 2025-06-02T12:54:04Z

Adds a new subsection "Device Preference Use Cases" to the device-selection-explainer.md document.

This subsection details several use cases for device selection preferences, mapping them to the preferences discussed in the W3C WebML WG minutes of 2025-05-08 (https://www.w3.org/2025/05/08-webmachinelearning-minutes.html#2ec0).

The use cases cover:

Preferring CPU
Preferring NPU
Preferring GPU
Maximizing performance
Maximizing power efficiency
Minimizing overall system power

Future-proof device names ("where JS and Wasm execute", "where WebGL and WebGPU programs execute", "other") are used in the descriptions.

Adds a new subsection "Device Preference Use Cases" to the device-selection-explainer.md document. This subsection details several use cases for device selection preferences, mapping them to the preferences discussed in the W3C WebML WG minutes of 2025-05-08 (https://www.w3.org/2025/05/08-webmachinelearning-minutes.html#2ec0). The use cases cover: - Preferring CPU - Preferring NPU - Preferring GPU - Maximizing performance - Maximizing power efficiency - Minimizing overall system power Future-proof device names ("where JS and Wasm execute", "where WebGL and WebGPU programs execute", "other") are used in the descriptions.

Copilot

Pull Request Overview

This PR adds a new subsection titled "Device Preference Use Cases" to the device-selection-explainer.md document to clearly outline different device selection strategies for WebNN applications.

Adds detailed use cases for preferring CPU, NPU, GPU, maximizing performance, maximizing power efficiency, and minimizing overall system power.
Provides example preference strings and descriptions mapped to W3C WebML WG minutes.

device-selection-explainer.md

Updates the description for the "Prefer execution on a Neural Processing Unit (NPU)" use case in device-selection-explainer.md. The term "other" as a future-proof device category is now explicitly defined as encompassing specialized ML accelerators that are distinct from CPUs and GPUs. This addresses feedback to improve clarity for you regarding the NPU use case and its relation to the "other" device category.

anssiko · 2025-06-02T13:20:11Z

N.B.: This is an experiment on how to translate the group's meeting discussion into an explainer update using coding agents working together. Any explainer content is non-normative and thus makes for a good testing ground. Also, in my experience, none of the agents understand all the Bikeshed syntax nuances yet but are good with Markdown.

That said, we require the usual human reviews for any changes, including for non-normative content.

Cc @dontcallmedom to share any W3C-level best practices on the topic.

zolkis · 2025-06-02T13:39:44Z

Playing along AI generated markup, we could include/merge these considerations as well :).

Hardware Selection Guide

Factor	CPU	GPU	NPU
Best For	Sequential logic, small models	Parallel training, large batches	Edge inference, low-power AI
Power Efficiency	Moderate	High consumption	Ultra-efficient
Latency	50–100 ms	10–30 ms	2–10 ms
Typical Use	Preprocessing, decision trees	LLM training, computer vision	Smartphones, IoT devices

Key Decision Criteria

Throughput needs:
- GPUs handle >10k queries/sec, while NPUs typically manage 1–5k queries/sec.
Model complexity:
- NPUs are optimized for transformer layers; GPUs excel at CNN/RNN workloads.
Deployment environment:
- NPUs dominate mobile and edge devices; GPUs are standard in cloud and data center environments, but usable in client environments as well.

Modern systems often combine all three:

CPUs for input handling,
GPUs for model execution,
NPUs for post-processing—balancing performance and efficiency.

zolkis · 2025-06-02T13:41:59Z

N.B. I am as well working on this right now, albeit using human mind (for synthesis), also adding to the historic sections.
Maybe we could merge and I take a human second pass on these sections.

zolkis · 2025-06-02T14:05:54Z

The md excerpt above was generated using Perplexity's agentic reasoning, taking into account about 20 sources, e.g.:
https://www.ibm.com/think/topics/npu-vs-gpu
https://collabnix.com/is-npu-better-than-gpu/
https://www.corsair.com/us/en/explorer/diy-builder/power-supply-units/cpu-vs-gpu-vs-npu-whats-the-difference/
https://gcore.com/learning/what-is-ai-inference
https://blog.squeezebits.com/36821
https://www.anyscale.com/blog/announcing-compiled-graphs
https://docs.vllm.ai/en/v0.8.2/models/extensions/runai_model_streamer.html
https://www.linkedin.com/posts/joseph-ede_how-to-load-information-in-artificial-intelligence-activity-7130211022090428417-YDZk/
and others, but still looks quite limited context to me.

When using AI tools, it would be nice to share how to reproduce and what sources/context was taken into account.
For instance, in a next (fun) step, I used Perplexity with this prompt:
https://www.perplexity.ai/search/explain-ai-model-loading-graph-5F42DrfqSJyVCI76Y8RjNA?0=d
Putting the output in a next comment.

zolkis · 2025-06-02T14:11:16Z

During local inference (graph compilation) of various AI models, summarize strategies on how to distribute compute nodes (operations) among GPU, NPU or CPU.

When distributing compute nodes across GPU, NPU, and CPU during AI model inference, optimal strategies depend on operation type, model architecture, and system constraints. Below are key approaches based on performance characteristics and hardware capabilities:

1. Operation-Type Optimization

Matrix multiplication (compute-bound):
Use GPUs for large matrix operations (e.g., transformer prefill stages). GPUs achieve 22.6% lower latency and 2× higher throughput than NPUs for these tasks due to parallel compute units¹.
- Example: Serving Llama 70B with TensorRT-LLM on NVIDIA Hopper GPUs².
Matrix-vector multiplication (memory-bound):
Deploy NPUs, which reduce latency by 58.5% compared to GPUs. NPUs leverage DMA for efficient memory access, ideal for LLM decode phases¹³.
- Example: NPUs process TinyLlama inference 3.2× faster than GPUs¹.
Low-complexity operations (e.g., dot product):
Assign to CPUs, which avoid GPU/NPU memory overhead and achieve lower latency for non-parallel tasks¹.

2. Model Architecture Considerations

Large Language Models (LLMs):
- Prefill: GPU clusters (compute-heavy)
- Decode: NPUs (memory-bound, sequential token generation)¹³
- Use disaggregated serving (e.g., NVIDIA Dynamo) to split phases across devices, boosting throughput by 30× on GB200 NVL72².
LSTM/RNN Models:
Prefer GPUs, which outperform NPUs by 2.7× due to irregular memory access patterns¹.
Vision Models (e.g., MobileNetV2):
- Small batches: NPUs (consistent latency)
- Large batches: GPUs (scaling throughput)¹

3. Batch Size and Latency Tradeoffs

Scenario	Preferred Hardware	Rationale
Small batch (1-8)	NPU	3× lower latency for video classification¹
Large batch (>32)	GPU	Throughput scales with parallel compute¹
Real-time SLOs	NPU + CPU	NPU for decode, CPU for lightweight ops³

4. Power-Constrained Deployments

NPUs consume ≤50% power of GPUs for equivalent performance, making them ideal for edge devices¹.
Use CPU/NPU hybrids for latency-sensitive applications requiring energy efficiency.

5. Dynamic Orchestration

Tools like NVIDIA Dynamo Planner monitor GPU/NPU utilization and automatically:

Shift decode GPUs to prefill during traffic spikes²
Select optimal tensor parallelism strategies (e.g., EP4DP16 for prefill, EP64DP3 for decode)²
Leverage NVIDIA Inference Transfer Library (NIXL) for low-latency data movement between heterogeneous devices².

By combining hardware-specific strengths with adaptive resource management, developers can achieve 2–30× throughput improvements while maintaining strict latency targets²¹.

References

²¹³⁴⁵⁶⁷⁸⁹¹⁰

zolkis · 2025-06-02T14:17:49Z

As for inputs for W3C best practices on explainers/spec generation (human to human specifications in general), I think we should make sure to digest all AI output through human synthesis (incl. context checking, extension, validation) even if it could be used just for code generation (machine - human - human - machine workflows), i.e. keeping a human in the middle, until we have sufficient machine abstracting power in the middle to make sure a given spec is a valid projection of the full context of the technical scope (as expected from engineers).

zolkis

Overall good, but based primarily on the ONNX API only. Not sure if the coverage matches well with other frameworks, or should we work more on distilling more common (as in intersection) use cases first. However, I think we could merge this and refine later.

device-selection-explainer.md

This commit incorporates feedback from PR webmachinelearning#855: - Adds a citation to ONNX Runtime's OrtExecutionProviderDevicePolicy as an informing API for the device preference list. - Refines the usage of "other" in use case descriptions for clarity: - Removes quotes from "other" in "Maximize Performance". - Removes redundant ("other") clarification for NPU in "Maximize Power Efficiency".

anssiko · 2025-06-02T19:03:53Z

Thank you @zolkis and our agents Jules and Copilot for your contributions to this experiment.

This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback. Key changes include: - Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR webmachinelearning#809) and the shift towards hint-based, implementation-led device selection. - Clarified the impact of key issues (webmachinelearning#749, webmachinelearning#302, webmachinelearning#350) and PRs (webmachinelearning#809, webmachinelearning#824, webmachinelearning#855) on the device selection strategy. - Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR webmachinelearning#855. - Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes. - Added new open questions based on recent discussions (e.g., issue webmachinelearning#836, PR webmachinelearning#854 regarding querying actual device usage). - Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR webmachinelearning#860). - Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches. - Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue webmachinelearning#815) and the investigation of `graph.devices` (issue webmachinelearning#836, PR webmachinelearning#854). - Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.

This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback. Key changes include: - Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR #809) and the shift towards hint-based, implementation-led device selection. - Clarified the impact of key issues (#749, #302, #350) and PRs (#809, #824, #855) on the device selection strategy. - Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR #855. - Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes. - Added new open questions based on recent discussions (e.g., issue #836, PR #854 regarding querying actual device usage). - Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR #860). - Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches. - Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue #815) and the investigation of `graph.devices` (issue #836, PR #854). - Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document. Signed-off-by: Zoltan Kis <[email protected]> With thanks to Jules for proofreading: Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

anssiko requested review from Copilot and zolkis June 2, 2025 12:54

Copilot AI reviewed Jun 2, 2025

View reviewed changes

device-selection-explainer.md Show resolved Hide resolved

zolkis approved these changes Jun 2, 2025

View reviewed changes

device-selection-explainer.md Show resolved Hide resolved

device-selection-explainer.md Outdated Show resolved Hide resolved

device-selection-explainer.md Outdated Show resolved Hide resolved

anssiko merged commit 61d4d5a into webmachinelearning:main Jun 2, 2025
2 checks passed

anssiko deleted the device-selection-use-cases branch June 2, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add device preference use cases to explainer #855

Add device preference use cases to explainer #855

Uh oh!

anssiko commented Jun 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

anssiko commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anssiko commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add device preference use cases to explainer #855

Add device preference use cases to explainer #855

Uh oh!

Conversation

anssiko commented Jun 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

anssiko commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Hardware Selection Guide

Key Decision Criteria

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis commented Jun 2, 2025

During local inference (graph compilation) of various AI models, summarize strategies on how to distribute compute nodes (operations) among GPU, NPU or CPU.

1. Operation-Type Optimization

2. Model Architecture Considerations

3. Batch Size and Latency Tradeoffs

4. Power-Constrained Deployments

5. Dynamic Orchestration

References

Footnotes

Uh oh!

zolkis commented Jun 2, 2025

Uh oh!

zolkis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anssiko commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants