design-reviews#626: WebGPU and WGSL

#626: WebGPU and WGSL

Opened Apr 27, 2021

Salut le TAG!

I'm requesting a TAG review of WebGPU and WGSL.

WebGPU is a proposed Web API to enable webpages to use the system's GPU (Graphics Processing Unit) to perform computations and draw complex images that can be presented inside the page. This goal is similar to the WebGL family of API, but WebGPU enables access to more advanced features of GPUs. Whereas WebGL is mostly for drawing images but can be repurposed (with great effort) to do other kinds of computations, WebGPU has first-class support for performing general computations on the GPU.

WGSL is the WebGpu Shading Language, WebGPU's companion programming language used to specify operations to execute on the GPU.

Explainer¹ (minimally containing user needs and example code): https://gpuweb.github.io/gpuweb/explainer/
Specification URL: https://gpuweb.github.io/gpuweb/ and https://gpuweb.github.io/gpuweb/wgsl/
Tests: https://github.com/gpuweb/cts
Security and Privacy self-review²: https://gpuweb.github.io/gpuweb/explainer/#questionnaire
GitHub repo (if you prefer feedback filed there): https://github.com/gpuweb/gpuweb
Primary contacts (and their relationship to the specification):
- Corentin Wallez (@kangz), Google (group co-chair)
- Kelsey Gilbert (@jdashg), Mozilla (group co-chair)
- Myles Maxfield (@litherum), Apple (WGSL editor)
- David Neto (@dneto0), Google (WGSL editor)
- Kai Ninomiya (@kainino0x), Google (WebGPU editor)
- Dzmitry Malyshau (@kvark), Mozilla (WebGPU editor)
Organization(s)/project(s) driving the specification: Apple, Google, Intel, Microsoft, Mozilla and more!
Key pieces of existing multi-stakeholder review or discussion of this specification: All of the Github repo and many meetings (see minutes)
External status/issue trackers for this specification (publicly visible, e.g. Chrome Status): Chrome Status, Firefox meta-bug

Further details:

I have reviewed the TAG's Web Platform Design Principles
Relevant time constraints or deadlines: Chromium is aiming for an Origin Trial around M93/94 (exact milestone TBD). The group is hoping to publish a recommendation around the end of the year.
The group where the work on this specification is currently being done: GPU for the Web CG
The group where standardization of this work is intended to be done: GPU for the Web WG
Major unresolved issues with or opposition to this specification:
This work is being funded by:

You should also know that...

There is a lot of engagement on the API even if it is only available behind flags in Firefox Nightly and Chromium Canary.
Native implementations of WebGPU (C/C++/Rust projections of the API), wgpu-rs and Dawn are also generating a lot of excitement. Applications written against them can also target WebGPU through WASM for free (e.g. using Emscripten).
Most of the structural issues in WebGPU have been resolved but there are still details being fleshed out.
WGSL got started later and at this time is still evolving in structural ways a bit.
You can drop in the WebGPU matrix channel if you have small questions about the API or WGSL.
If you're interested in tutorials for the API, you can read the following:
- Get Started with GPU Compute on the Web for a first dive into WebGPU.
- The WebGPU samples are a great collection of small and not-so-small WebGPU samples.
- Learn wgpu tutorials for the 1-1 mapping of WebGPU to Rust. Note that these tutorials use GLSL instead of WGSL.

We'd prefer the TAG provide feedback as:

🐛 open issues in our GitHub repo for each point of feedback

Discussions

Comment by @Kangz May 5, 2021 (See Github)

Note that WebGPU's concept of invalid objects is where a bunch of its innovation budget is, and different compared to all the other Javascript APIs (except some similarities with WebGL) so I expect you'll want to take a detailed look at it.

Comment by @torgo Jun 2, 2021 (See Github)

Hi @Kangz thanks for bearing with us as we get a review going on this. It's a dense piece of work so any further steer you have on what you would like us to particularly look at would be great. Appreciate the thoroughness of the review request especially responses to the Security & Privacy questionnaire.

Comment by @kainino0x Jun 2, 2021 (See Github)

@torgo Of course, we understand it's a huge API :) Are you looking for pointers to content covered by the explainer, or to content in the spec that is not covered by the explainer?

The items discussed by the explainer are the ones that we thought would be of greatest interest, both in terms of design scrutiny and for understanding the API at a high level. It's hopefully organized in such a way that you can skip over details as needed to make it more digestible.

Of course there are a lot of aspects of the API that are not touched upon by the explainer, but for the most part these are highly domain-specific designs that are constrained by the shape of underlying native APIs, where we don't have that much design freedom. And there's a huge volume of these, so they are hard to summarize in an explainer.

Comment by @cynthia Jun 7, 2021 (See Github)

I haven't had enough time to dive deep into the spec, but here are some questions after giving it a quick first-pass.

Lots of strings in the API is USVString - feels like DOMString would be more appropriate, is there something I'm missing?
Given the power of the API, would it make sense to consider gating this behind a permission? I believe the counter argument is - why should this be behind a permission when WebGL is not, and I think WebGL was an unfortunate choice given how it is being abused to fingerprint users. (Also, I sense crypto miners would love to abuse this...)
Should the cache be origin-gated? If the compilation is incredibly fast this shouldn't be an issue, but with slow enough compilation times timing-based fingerprinting feels like it would be possible.
Is there a story for mitigating denial-of-service by exhausting VRAM? (Technically, I guess it won't be a denial of service, but more of a "new tabs paint crazy slow" from an end-user perspective.)

Comment by @Kangz Jun 7, 2021 (See Github)

Given the power of the API, would it make sense to consider gating this behind a permission? I believe the counter argument is - why should this be behind a permission when WebGL is not, and I think WebGL was an unfortunate choice given how it is being abused to fingerprint users. (Also, I sense crypto miners would love to abuse this...)

There's also the WebGL argument, but also that having permission-less access to at list the basic features of WebGPU is generally useful. More and more websites are showing 3D content and it would add a lot of friction to ask for a permission just to show this content. (use cases are online shops, journal/blog articles, ads, company websites, fancy backgrounds, etc). With the structure of the API, the browser should have a lot of choice in how it exposes WebGPU depending on the context of the page (how much the user interacts with it, iframe vs not, etc).

I'm not an expert on fingerprinting but AFAIK WebGL fingerprinting (outside the renderer string) is somewhat equivalent to the 2D-canvas fingerprinting. WebGPU should be the same. Likewise crypto-mining abuse is already available with JS/WASM/WebGL and WebGPU shouldn't make it that much more efficient than WebGL AFAIK so browsers should develop/are developing means to detect abuse.

Should the cache be origin-gated? If the compilation is incredibly fast this shouldn't be an issue, but with slow enough compilation times timing-based fingerprinting feels like it would be possible.

These caches should be origin-gated, and are listed in https://github.com/privacycg/storage-partitioning

Is there a story for mitigating denial-of-service by exhausting VRAM? (Technically, I guess it won't be a denial of service, but more of a "new tabs paint crazy slow" from an end-user perspective.)

The browsers could track VRAM usage on the GPU process, decide on quotas if needed and "lose the device" if they needs to reclaim quota. It's easy to VRAM denial-of-service via other means too (like 2D canvas, but also many stacked GPU-accelerated DOM layer with stuff like will-animate etc).

Comment by @kainino0x Jun 8, 2021 (See Github)

Lots of strings in the API is USVString - feels like DOMString would be more appropriate, is there something I'm missing?

Good question. This was a deliberate decision in the places where we used it:

All of the debug strings (GPUObjectBase/label, GPUObjectDescriptorBase/label, pushDebugGroup/groupLabel, insertDebugMarker/markerLabel) may optionally be passed down to the GPU driver to enhance the debugging experience for developers running specialized GPU debugging tools against the browser. Implementations which choose to do this would presumably perform some further sanitization on these strings before passing them, but nonetheless invalid unicode strings don't make sense - at least some drivers would take UTF-8 (or maybe ASCII) to drivers.
Shader code (GPUShaderModuleDescriptor/code, GPUProgrammableStage/entryPoint) is processed as a unicode string by WGSL, so a restriction is needed here as well. In practice, implementations are likely to do that processing in UTF-8, and this also enables that. Once again however, there is still a requirement to handle certain other edge cases (like \0, filed https://github.com/gpuweb/gpuweb/issues/1816)

Comment by @kainino0x Jun 9, 2021 (See Github)

It was discussed in https://github.com/gpuweb/gpuweb/issues/784

Comment by @kainino0x Jun 10, 2021 (See Github)

It's a dense piece of work so any further steer you have on what you would like us to particularly look at would be great.

We talked about this further and here are the explainer sections that I think you should focus on:

Introduction
Additional Background
Adapters and Devices intro section
Optional Capabilities: important because it exposes hardware differences.
Object Validity and Destroyed-ness: important because it introduces a new error mechanism to the platform.
Errors intro section, Problems and Solutions intro section, rest as interested.
Buffer Mapping intro section, rest as interested.
Multithreading: eventually important, but not urgent because not part of MVP.
Image, Video, and Canvas input, Canvas Output: if interested in these topics.
Bitflags: optional, a small peculiarity in the JS API.

I'm not specifically aware of any important items that are in the spec but not in the explainer. (If I were, I'd have written an explainer section on them :) However, I could be forgetting something.)

Comment by @Kangz Sep 2, 2021 (See Github)

Hey all, any progress on the review for WebGPU and WGSL? We are making significant progress in the group towards a first version of WebGPU and would like to incorporate feedback from the TAG early if possible.

Comment by @beaufortfrancois Sep 2, 2021 (See Github)

For info, we've announced an origin trial in Chrome 94. See https://twitter.com/ChromiumDev/status/1432257883912216577 and https://web.dev/gpu/

Comment by @domenic Sep 3, 2021 (See Github)

A particular issue here where the specification is departing from web API design best practices is https://github.com/gpuweb/gpuweb/issues/244 ; that might be worth looking into.

Discussed Sep 20, 2021 (See Github)

Sangwhan: will block out some time, it's a massive spec

Ken: would be nice if people who created babylonjs or threejs to give feedback

Discussed Sep 27, 2021 (See Github)

Sangwhan: do we have a compute pressure api equivalent for gpus?

Ken: no they don't have that telemetry yet

Sangwhan: risk that a web gpu kernal could choke the gpu to a point where [???]. A WG being chartered

Dan: from process and multi stakeholder perspective this looks good

Sangwhan: disagreement about what to use as shading language seems to have been resolved. SPIR-V and HLSL

Ken: ended up doing their own. Now there's WSL or something. I hope this new one is more webby.

Dan: concerned about fingerprinting and any kind of cross processor vulnerabilities

Sangwhan: a few ways to leak information through webgl, and webgl is a massive fingerprinting surface. This probably opens up more holes. Should this be permission gated? It's incredibly powerful. The memory model part is a concern for me. The fact that gpu memory protection is a bit less .. not as battle tested as cpu memory. Hasn't been abused as much. There are plenty of cases where crappy gpu driver can leak information through the vram

Dan: let's ask these questions

Sangwhan: nothing they can do about it if the gpu driver is bad. In that case should we even do this work, but there's a huge user need / industry need.

Dan: we should be encouraging them to document risks and come up with mitigations

Sangwhan: I think they have a section on this

Dan: S&P review is not talking about fingerprinting at all.

Sangwhan: they do. Also DoS. Thing about memory security.

Dan: don't have specific mitigations

Sangwhan: sort of impossible becausae browser level implementers don't have control over memory controller

Dan: yes and one can find oneself on a webpage that invokes this api by following a malicious link, so it's much easier to get into this situation when you are on the web so just saying it's no different than a native app - people usually make an affirmative choice to install an app, but you don't necessarily have that choice when you're talking about the web, hence safe to brows, blah blah blah. We need to put pressure about mitigations and potentially a permissions step

Sangwhan: permissions definitely something I'd bring up. Mitigations are tricky - drivers are controlled by AMD, nvidia, Intel, ...

Hadley: worth talking to them? Might be the way the industry is going, not just the web.

Sangwhan: a lot of work has been done on the CPU. I'll try to leave feedback. It's tricky. GPU driver stuff is completely proprietary. Different drivers/gpus have different issues. Complexity of the problem increases. I'll draft a comment.

Dan: I might paste in the safe to browse design principle and reinforce this point.

Comment by @Kangz Nov 16, 2021 (See Github)

Hey all, any progress on this review? I know it's a big chunk of work, but all browsers are actively implementing the spec now and looking forward to ship, so it would be nice to make changes based on TAG feedback earlier than later.

Comment by @kenchris Dec 7, 2021 (See Github)

This is a very nice explainer and it is quite clear that you have put a lot of effort into this and it really shows! @cynthia and I discussed in our breakout that as this has more raw access to the GPU, what is the mitigation in place to make sure this doesn't negatively affect other GPU intensive apps, like the browser itself or the host OS?

If you have a misbehaving web app, how is this handled? For CPU in Chrome today there is infinite loop detection and the tab can be closed, but GPU can affect the whole system and not just a single tab.

Comment by @cynthia Dec 7, 2021 (See Github)

I have some follow-up comments and questions on this which I will post before our VF2F is over, but overall we are happy with what is being proposed.

Comment by @Kangz Dec 7, 2021 (See Github)

I have some follow-up comments and questions on this which I will post before our VF2F is over, but overall we are happy with what is being proposed.

Thank you, looking forward to your comments!

what is the mitigation in place to make sure this doesn't negatively affect other GPU intensive apps, like the browser itself or the host OS?

Preemption of GPU workloads is not a ubiquitous feature and even GPU that support preemption have it for different levels of granularity. So it's impossible to completely prevent denial of service of the GPU. However OSes have mechanisms to reset the GPU if it is unresponsive for a while, like Windows' TDR mechanism. From applications point of view they "lose" the GPU. While a rare event, it is not exceptional and other applications should handle it correctly (external GPUs can be unplugged for example).

Browsers can also detect when the GPU is occupied for too long. Chromium has a "watchdog" mechanism that triggers when GPU execution takes too long, and will restart the GPU process in that case as an attempt to free resources. It will also give a "strike" to the responsible origin. After some strikes the origin can be blocklisted from using the GPU.

Note that denial of service of the GPU is not something unique to WebGPU. It is trivial to do so with a while(true){} in a WebGL shader and also possible by creating massive GPU workloads with <canvas> or the DOM (with lots of overdraw of filtered layers for example).

Comment by @Kangz Jan 4, 2022 (See Github)

@cynthia you mentioned you'd post questions after the VF2F is over, do you have a list that the WebGPU group can look at?

Discussed Jan 17, 2022 (See Github)

Planning for how to do review so we can schedule it in for next week.

Added Max to issue

Comment by @Kangz Jan 28, 2022 (See Github)

Sorry to ping again, but this review has been open for 9 month already. TAG review is the most important of the horizontal reviews for WebGPU (PING comes close, others are mostly N/A). Implementations are well under way so getting feedback soon would help reduce the risk of needing to do late API changes.

Discussed Feb 7, 2022 (See Github)

Max did some review, will coordinate with other assignees.

Comment by @Kangz Mar 21, 2022 (See Github)

Pinging again. This review has been open for 11 months now. We're getting ever closer to a shippable API and really need your review. We're supposed to wait for TAG review to move WebGPU to candidate recommendation but it can't be blocked on TAG review forever considering the lead time, reminders and detailed explainers.

So please, give us some feedback, or let's figure out a way to help you review WebGPU (via more explainers, walkthroughs with group members, etc).

Comment by @cynthia Mar 22, 2022 (See Github)

Sorry for the delayed turnaround. We had this scheduled for our F2F but looking at the minutes it looks like we didn't have enough time to look into that during the breakout where it was supposed to happen. Will discuss within the group and try to provide some closure on this before the F2F finishes.

Discussed Apr 11, 2022 (See Github)

Dan: closing comment?

Sangwhan: questions and a closing comment. Will share questions. API design itself looks fine but there should be some safety nets. If you're on a single GPU system you can use any amount of GPU resources on the tab. Something you can do with WebGL but that's a bug not a feature.

.. My comments:

Higher granularity for selecting devices? For example, for cases where one wants to select a high-VRAM device (e.g. for machine learning workloads) or the non-dominant (e.g. secondary device which hopefully will not jank the current compositor) device.
Power consumption - while there is a mechanism to allow applications to request a high power vs low power device, wondering if the choice should be initiated by the user, and not the application
e.g. if the user is on the go, let them choose integrated GPU instead of discrete to save power
Revisiting position on permissions based on this data - and some informal asking around non-technical folks. Graphics is too generic and users will likely be confused - e.g. “yes, I guess the website needs graphics” when it is actually trying to mine Ethereum.
Instead of a permission, should this be something else..?
e.g. big fat warning suggesting “this will use lots of power”
Re: USVString vs DOMString - saw discussion LGTM.

Dan: proposed close?

Sangwhan: yes [will leave comment]

Discussed Apr 11, 2022 (See Github)

Dan: we are late on this one

Max: a couple of weeks ago Sangwhan and I had a discussion, and had a summary with further questions to ask, but didn't see Sangwhan send the questions. Probably we can discuss in plenary? My suggestion is to send the further questions.

Dan: yes. Do you have the questions? You can post them

Max: we wrote them in a google doc, I can't access it

Dan: let's see if we can get him to post those.. they're waiting. Left a note for Sangwhan.

Discussed Apr 18, 2022 (See Github)

Max: Sangwhan had questions, hasn't put them on github yet. Had a discussion and he summarised further questions to ask

Amy: did he put them in slack?

Max: ah yeah.. about the hierarchy of selecting devices and power consumption concerns and information ..

Amy: you can post them?

Max: we can wait for Sangwhan.. I'll remind

Sangwhan: [appears] I will do that

Comment by @Kangz Apr 27, 2022 (See Github)

It is now the one year birthday of this request for a TAG review, and there is still no official feedback that has been provided.

Some TAG members did look at the API and provided feedback out of band, but it doesn't replace the official review.
WebGPU is a large and complex API, but the CG/WG members provided detailed explainers and have always been available to write more or answer questions directly.
We were told the TAG review queue could be month long and that the group should be patient, but it's been a year, multiple reminders after the 6th month and WebGPU is still getting moved from milestone to milestone.

We're trying to finish the first version of the WebGPU API and go to CR. We're supposed to wait for horizontal reviews but blocking indefinitely on the TAG is not okay. If feedback from the TAG comes before CR then we'll try to address it, but if it doesn't, then we'll have to do without.

My suggestion for handling large APIs like WebGPU in the TAG is to ask the groups which parts of the API are most "at risk" of receiving feedback and do partial reviews for each of these parts. Trying to make one full review for the full API can be too high a commitment for TAG members, and might not be a good use of their time. Especially when some details (in the case of WebGPU) are purely technical and internal, without any interaction with the user or the rest of the Web platform.

Comment by @Kangz May 20, 2022 (See Github)

Ping again :)

Comment by @cynthia May 23, 2022 (See Github)

Apologies for the delay. The review itself is complete on our end, but got delayed due to other internal discussions. The TL:DR; summary is that while we have not been able to deeply investigate the detailed utility/risks due to the size of the spec, the overall impression is that we think this work shows a lot of promise and is a definite improvement over the status quo (WebGL).

Here are some comments that came up during the review:

Should there be a higher granularity for selecting devices? For example, for cases where one wants to select a high-VRAM device (e.g. for machine learning workloads) or the non-dominant (e.g. secondary device which hopefully will not jank the current compositor) device. Allowing a single application to be the dominant resource user is not new, but it is definitely something that want to be careful about.
Power consumption is a concern. While there is a mechanism to allow applications to request a high power vs low power device, we are wondering if the choice should be initiated by the user, and not the application. e.g. if the user is on the go, let them choose integrated GPU instead of discrete to save power. Initially, we discussed the possibility of forcing the application to integrated in low-power scenarios, but adds another fingerprinting surface so we don't think this is a valid suggestion.
It might be useful to have a note suggesting implementations should surface a warning about power usage at some point, given that this would be permission free.
We have since revisited our position on permissions based on this data. Based on some informal asking around non-technical folks, the terms "graphics" is too generic and users will likely be confused - e.g. “yes, I guess the website needs graphics” when it is actually trying to mine Ethereum. For these reasons, gating behind a permission might not be adequate; but it might be useful to add a note (implementation detail) which allows users to revoke access (or kill it) if an application is hogging the GPU. (this would be analogous to the "this tab is hanging" dialog)
Re: USVString vs DOMString - we are happy with the group's conclusion.

Comment by @cynthia May 23, 2022 (See Github)

There is one meta-question that remains to be answered here, how can we do better with large specs like this? Current process is very optimized towards small-ish, self-contained specs that can be reviewed in a single day. This definitely wasn't one of them and we had multiple F2F meetings where we would not have enough time in a single sitting, causing it to be rescheduled multiple times. We'll discuss this internally and see if we can make a proposal to improve this going forward.

Comment by @Kangz May 23, 2022 (See Github)

Thank you for posting this! I'm glad that the TAG is overall satisfied with the current spec. Copied this feedback in a new issue (https://github.com/gpuweb/gpuweb/issues/2927) so that the WebGPU group can discuss it and come back to the TAG.

Comment by @cynthia May 25, 2022 (See Github)

Thanks for bringing the feedback to the group, and apologies this took so long.