Skip to content

Conversation

@anssiko
Copy link
Member

@anssiko anssiko commented Jun 2, 2025

Adds a new subsection "Device Preference Use Cases" to the device-selection-explainer.md document.

This subsection details several use cases for device selection preferences, mapping them to the preferences discussed in the W3C WebML WG minutes of 2025-05-08 (https://www.w3.org/2025/05/08-webmachinelearning-minutes.html#2ec0).

The use cases cover:

  • Preferring CPU
  • Preferring NPU
  • Preferring GPU
  • Maximizing performance
  • Maximizing power efficiency
  • Minimizing overall system power

Future-proof device names ("where JS and Wasm execute", "where WebGL and WebGPU programs execute", "other") are used in the descriptions.

Adds a new subsection "Device Preference Use Cases" to the
device-selection-explainer.md document.

This subsection details several use cases for device selection preferences,
mapping them to the preferences discussed in the W3C WebML WG minutes
of 2025-05-08 (https://www.w3.org/2025/05/08-webmachinelearning-minutes.html#2ec0).

The use cases cover:
- Preferring CPU
- Preferring NPU
- Preferring GPU
- Maximizing performance
- Maximizing power efficiency
- Minimizing overall system power

Future-proof device names ("where JS and Wasm execute", "where WebGL
and WebGPU programs execute", "other") are used in the descriptions.
@anssiko anssiko requested review from Copilot and zolkis June 2, 2025 12:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new subsection titled "Device Preference Use Cases" to the device-selection-explainer.md document to clearly outline different device selection strategies for WebNN applications.

  • Adds detailed use cases for preferring CPU, NPU, GPU, maximizing performance, maximizing power efficiency, and minimizing overall system power.
  • Provides example preference strings and descriptions mapped to W3C WebML WG minutes.

Updates the description for the "Prefer execution on a Neural
Processing Unit (NPU)" use case in device-selection-explainer.md.

The term "other" as a future-proof device category is now
explicitly defined as encompassing specialized ML accelerators
that are distinct from CPUs and GPUs.

This addresses feedback to improve clarity for you regarding
the NPU use case and its relation to the "other" device category.
@anssiko
Copy link
Member Author

anssiko commented Jun 2, 2025

N.B.: This is an experiment on how to translate the group's meeting discussion into an explainer update using coding agents working together. Any explainer content is non-normative and thus makes for a good testing ground. Also, in my experience, none of the agents understand all the Bikeshed syntax nuances yet but are good with Markdown.

That said, we require the usual human reviews for any changes, including for non-normative content.

Cc @dontcallmedom to share any W3C-level best practices on the topic.

@zolkis
Copy link
Collaborator

zolkis commented Jun 2, 2025

Playing along AI generated markup, we could include/merge these considerations as well :).

Hardware Selection Guide

Factor CPU GPU NPU
Best For Sequential logic, small models Parallel training, large batches Edge inference, low-power AI
Power Efficiency Moderate High consumption Ultra-efficient
Latency 50–100 ms 10–30 ms 2–10 ms
Typical Use Preprocessing, decision trees LLM training, computer vision Smartphones, IoT devices

Key Decision Criteria

  • Throughput needs:
    • GPUs handle >10k queries/sec, while NPUs typically manage 1–5k queries/sec.
  • Model complexity:
    • NPUs are optimized for transformer layers; GPUs excel at CNN/RNN workloads.
  • Deployment environment:
    • NPUs dominate mobile and edge devices; GPUs are standard in cloud and data center environments, but usable in client environments as well.

Modern systems often combine all three:

  • CPUs for input handling,
  • GPUs for model execution,
  • NPUs for post-processing—balancing performance and efficiency.

@zolkis
Copy link
Collaborator

zolkis commented Jun 2, 2025

N.B. I am as well working on this right now, albeit using human mind (for synthesis), also adding to the historic sections.
Maybe we could merge and I take a human second pass on these sections.

@zolkis
Copy link
Collaborator

zolkis commented Jun 2, 2025

The md excerpt above was generated using Perplexity's agentic reasoning, taking into account about 20 sources, e.g.:
https://www.ibm.com/think/topics/npu-vs-gpu
https://collabnix.com/is-npu-better-than-gpu/
https://www.corsair.com/us/en/explorer/diy-builder/power-supply-units/cpu-vs-gpu-vs-npu-whats-the-difference/
https://gcore.com/learning/what-is-ai-inference
https://blog.squeezebits.com/36821
https://www.anyscale.com/blog/announcing-compiled-graphs
https://docs.vllm.ai/en/v0.8.2/models/extensions/runai_model_streamer.html
https://www.linkedin.com/posts/joseph-ede_how-to-load-information-in-artificial-intelligence-activity-7130211022090428417-YDZk/
and others, but still looks quite limited context to me.

When using AI tools, it would be nice to share how to reproduce and what sources/context was taken into account.
For instance, in a next (fun) step, I used Perplexity with this prompt:
https://www.perplexity.ai/search/explain-ai-model-loading-graph-5F42DrfqSJyVCI76Y8RjNA?0=d
Putting the output in a next comment.

@zolkis
Copy link
Collaborator

zolkis commented Jun 2, 2025

During local inference (graph compilation) of various AI models, summarize strategies on how to distribute compute nodes (operations) among GPU, NPU or CPU.

When distributing compute nodes across GPU, NPU, and CPU during AI model inference, optimal strategies depend on operation type, model architecture, and system constraints. Below are key approaches based on performance characteristics and hardware capabilities:

1. Operation-Type Optimization

  • Matrix multiplication (compute-bound):
    Use GPUs for large matrix operations (e.g., transformer prefill stages). GPUs achieve 22.6% lower latency and 2× higher throughput than NPUs for these tasks due to parallel compute units1.
    • Example: Serving Llama 70B with TensorRT-LLM on NVIDIA Hopper GPUs2.
  • Matrix-vector multiplication (memory-bound):
    Deploy NPUs, which reduce latency by 58.5% compared to GPUs. NPUs leverage DMA for efficient memory access, ideal for LLM decode phases13.
    • Example: NPUs process TinyLlama inference 3.2× faster than GPUs1.
  • Low-complexity operations (e.g., dot product):
    Assign to CPUs, which avoid GPU/NPU memory overhead and achieve lower latency for non-parallel tasks1.

2. Model Architecture Considerations

  • Large Language Models (LLMs):
    • Prefill: GPU clusters (compute-heavy)
    • Decode: NPUs (memory-bound, sequential token generation)13
    • Use disaggregated serving (e.g., NVIDIA Dynamo) to split phases across devices, boosting throughput by 30× on GB200 NVL722.
  • LSTM/RNN Models:
    Prefer GPUs, which outperform NPUs by 2.7× due to irregular memory access patterns1.
  • Vision Models (e.g., MobileNetV2):
    • Small batches: NPUs (consistent latency)
    • Large batches: GPUs (scaling throughput)1

3. Batch Size and Latency Tradeoffs

Scenario Preferred Hardware Rationale
Small batch (1-8) NPU 3× lower latency for video classification1
Large batch (>32) GPU Throughput scales with parallel compute1
Real-time SLOs NPU + CPU NPU for decode, CPU for lightweight ops3

4. Power-Constrained Deployments

  • NPUs consume ≤50% power of GPUs for equivalent performance, making them ideal for edge devices1.
  • Use CPU/NPU hybrids for latency-sensitive applications requiring energy efficiency.

5. Dynamic Orchestration

Tools like NVIDIA Dynamo Planner monitor GPU/NPU utilization and automatically:

  • Shift decode GPUs to prefill during traffic spikes2
  • Select optimal tensor parallelism strategies (e.g., EP4DP16 for prefill, EP64DP3 for decode)2
  • Leverage NVIDIA Inference Transfer Library (NIXL) for low-latency data movement between heterogeneous devices2.

By combining hardware-specific strengths with adaptive resource management, developers can achieve 2–30× throughput improvements while maintaining strict latency targets21.

References

21345678910

Footnotes

  1. https://arxiv.org/html/2409.14803v1 2 3 4 5 6 7 8 9 10 11 12

  2. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/ 2 3 4 5 6 7

  3. https://arxiv.org/html/2504.09775v1 2 3 4

  4. https://www.amd.com/en/developer/resources/technical-articles/model-pipelining-on-npu-and-gpu-using-ryzen-ai-software.html

  5. https://codemia.io/blog/path/Inference-at-the-Edge-How-the-Shift-Away-from-Data-Center-AI-Will-Reshape-System-Design

  6. https://github.com/gpustack/gpustack

  7. https://www.reddit.com/r/LocalLLaMA/comments/13caqcd/i_put_together_plans_for_an_absolute_budget_pc/

  8. https://www.scaleway.com/en/blog/distributed-ml-model-inference/

  9. https://www.linkedin.com/pulse/comparing-cpus-gpus-npus-ai-workloads-cybersecurity-serafin-iii-jaqve

  10. https://www.aiacceleratorinstitute.com/improving-ai-inference-performance-with-hardware-accelerators/

@zolkis
Copy link
Collaborator

zolkis commented Jun 2, 2025

As for inputs for W3C best practices on explainers/spec generation (human to human specifications in general), I think we should make sure to digest all AI output through human synthesis (incl. context checking, extension, validation) even if it could be used just for code generation (machine - human - human - machine workflows), i.e. keeping a human in the middle, until we have sufficient machine abstracting power in the middle to make sure a given spec is a valid projection of the full context of the technical scope (as expected from engineers).

Copy link
Collaborator

@zolkis zolkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good, but based primarily on the ONNX API only. Not sure if the coverage matches well with other frameworks, or should we work more on distilling more common (as in intersection) use cases first. However, I think we could merge this and refine later.

This commit incorporates feedback from PR webmachinelearning#855:

- Adds a citation to ONNX Runtime's OrtExecutionProviderDevicePolicy
  as an informing API for the device preference list.
- Refines the usage of "other" in use case descriptions for clarity:
    - Removes quotes from "other" in "Maximize Performance".
    - Removes redundant ("other") clarification for NPU in
      "Maximize Power Efficiency".
@anssiko anssiko merged commit 61d4d5a into webmachinelearning:main Jun 2, 2025
2 checks passed
@anssiko
Copy link
Member Author

anssiko commented Jun 2, 2025

Thank you @zolkis and our agents Jules and Copilot for your contributions to this experiment.

@anssiko anssiko deleted the device-selection-use-cases branch June 2, 2025 19:03
zolkis pushed a commit to zolkis/webnn that referenced this pull request Jun 17, 2025
This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback.

Key changes include:
- Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR webmachinelearning#809) and the shift towards hint-based, implementation-led device selection.
- Clarified the impact of key issues (webmachinelearning#749, webmachinelearning#302, webmachinelearning#350) and PRs (webmachinelearning#809, webmachinelearning#824, webmachinelearning#855) on the device selection strategy.
- Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR webmachinelearning#855.
- Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes.
- Added new open questions based on recent discussions (e.g., issue webmachinelearning#836, PR webmachinelearning#854 regarding querying actual device usage).
- Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR webmachinelearning#860).
- Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches.
- Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue webmachinelearning#815) and the investigation of `graph.devices` (issue webmachinelearning#836, PR webmachinelearning#854).
- Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.
zolkis pushed a commit to zolkis/webnn that referenced this pull request Jun 17, 2025
This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback.

Key changes include:
- Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR webmachinelearning#809) and the shift towards hint-based, implementation-led device selection.
- Clarified the impact of key issues (webmachinelearning#749, webmachinelearning#302, webmachinelearning#350) and PRs (webmachinelearning#809, webmachinelearning#824, webmachinelearning#855) on the device selection strategy.
- Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR webmachinelearning#855.
- Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes.
- Added new open questions based on recent discussions (e.g., issue webmachinelearning#836, PR webmachinelearning#854 regarding querying actual device usage).
- Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR webmachinelearning#860).
- Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches.
- Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue webmachinelearning#815) and the investigation of `graph.devices` (issue webmachinelearning#836, PR webmachinelearning#854).
- Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.
zolkis pushed a commit to zolkis/webnn that referenced this pull request Jun 17, 2025
This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback.

Key changes include:
- Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR webmachinelearning#809) and the shift towards hint-based, implementation-led device selection.
- Clarified the impact of key issues (webmachinelearning#749, webmachinelearning#302, webmachinelearning#350) and PRs (webmachinelearning#809, webmachinelearning#824, webmachinelearning#855) on the device selection strategy.
- Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR webmachinelearning#855.
- Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes.
- Added new open questions based on recent discussions (e.g., issue webmachinelearning#836, PR webmachinelearning#854 regarding querying actual device usage).
- Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR webmachinelearning#860).
- Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches.
- Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue webmachinelearning#815) and the investigation of `graph.devices` (issue webmachinelearning#836, PR webmachinelearning#854).
- Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.
anssiko pushed a commit that referenced this pull request Jun 27, 2025
This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback.

Key changes include:
- Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR #809) and the shift towards hint-based, implementation-led device selection.
- Clarified the impact of key issues (#749, #302, #350) and PRs (#809, #824, #855) on the device selection strategy.
- Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR #855.
- Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes.
- Added new open questions based on recent discussions (e.g., issue #836, PR #854 regarding querying actual device usage).
- Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR #860).
- Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches.
- Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue #815) and the investigation of `graph.devices` (issue #836, PR #854).
- Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.

Signed-off-by: Zoltan Kis <[email protected]>
With thanks to Jules for proofreading:
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
anssiko pushed a commit that referenced this pull request Aug 8, 2025
This commit comprehensively updates the device selection explainer to reflect the latest discussions, API changes, and community feedback.

Key changes include:
- Updated Introduction and History sections to accurately reflect the removal of `MLDeviceType` from `MLContextOptions` (following PR #809) and the shift towards hint-based, implementation-led device selection.
- Clarified the impact of key issues (#749, #302, #350) and PRs (#809, #824, #855) on the device selection strategy.
- Ensured the "Key use cases and requirements" section aligns with the current API, incorporating the device preference use cases from PR #855.
- Updated JavaScript examples in "Scenarios, examples, design discussion" to be consistent with the current API, marking future/hypothetical features (like `opSupportLimitsPerDevice()` and a `fallback` option) with explanatory notes.
- Added new open questions based on recent discussions (e.g., issue #836, PR #854 regarding querying actual device usage).
- Refined the "Background thoughts" section, particularly the "Example Hardware Selection Guide," adding an editor's note about ongoing discussions (PR #860).
- Corrected the "Considered alternatives" and "Minimum Viable Solution" sections to accurately represent the current and past approaches.
- Updated the "Next Phase Device Selection Solution" to clarify the status of proposals like `querySupport` (issue #815) and the investigation of `graph.devices` (issue #836, PR #854).
- Performed a full proofread, correcting grammar, typos, and markdown formatting for improved clarity and consistency throughout the document.

Signed-off-by: Zoltan Kis <[email protected]>
With thanks to Jules for proofreading:
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants