design-reviews#843: Web Audio API: RenderCapacity API

#843: Web Audio API: RenderCapacity API

Opened May 10, 2023

I'm requesting a TAG review of RenderCapacity API.

Generally, the Web Audio renderer’s performance is affected by the machine speed and the computational load of an audio graph. However, Web Audio API does not expose a way to monitor the computational load, and it leaves developers no options to detect glitches that are the essential part of UX in audio applications. Providing developers with a “glitch indicator” is getting more important because the scale of audio applications grows larger and more complex. (Developers have been asking for this feature since 2018.)

Explainer: bit.ly/render-capacity-explainer (will be moved to a markdown doc soon)
Specification URL: https://webaudio.github.io/web-audio-api/#AudioRenderCapacity (Editor's Draft)
Tests: N/A
User research: N/A
Security and Privacy self-review²: See the Privacy Concern section in the explainer.
Primary contacts:
- Hongchan Choi (@hoch), W3C Audio Working Group Co-chair
- Paul Adenot (@padenot), Web Audio API specification editor
Organization(s)/project(s) driving the specification: W3C Audio Working Group
Key pieces of existing multi-stakeholder review or discussion of this specification: https://github.com/WebAudio/web-audio-api/issues/2444
External status/issue trackers for this specification (publicly visible, e.g. Chrome Status): https://chromestatus.com/feature/5180333360676864

Further details:

[v] I have reviewed the TAG's Web Platform Design Principles
Relevant time constraints or deadlines: 2023 Q2~Q3
The group where the work on this specification is currently being done: W3C Audio WG
The group where standardization of this work is intended to be done (if current group is a community group or other incubation venue): N/A
Major unresolved issues with or opposition to this specification: N/A
This work is being funded by: N/A

We'd prefer the TAG provide feedback as (please delete all but the desired option): 💬 leave review feedback as a comment in this issue and @hoch @padenot

Discussions

Discussed Jul 1, 2023 (See Github)

Overall looks good. Unsure what "load" means in this case; do they mean Unix loadavg? If so, that doesn't work on Windows. Tess commented to this effect.

Comment by @hober Aug 2, 2023 (See Github)

How are you defining "load" (as exposed in averageLoad and peakLoad? Is this "unix load average as reported by /usr/bin/w? Is there an equivalent concept on non-Unix platforms?

Comment by @padenot Aug 3, 2023 (See Github)

Audio systems, when rendering an audio stream, typically work with a synchronous audio callback (called in the spec a system level audio callback), that is called in an isochronous fashion, by the system, on a real-time thread, with a buffer of n frames, that the program must fill entirely, and the return as soon as possible.

This callback, called continuously during the lifetime of the audio stream, returns to the system with audio samples, and then hands it off to the rest of the OS. This audio might be post-processed and is usually output on an audio output device, such as headphones or speakers.

Let frames[i] be the number of frames that has to be rendered by the callback on the i-th iteration (a few hundreds to a couple of thousands is typical in this situation) Let sr is the sample-rate at which the audio system runs (44100Hz or 48000Hz are typical values) Let r[i] the time, in seconds, it took to render n frames this time. This is in other words, the execution time of the callback

frames[i] / sr is a number of audio frames, divided by the sample-rate, so it's a duration in seconds. It's the duration a buffer of frames[i] samples takes to be played out.

The load for this render quantum is:

load[i] = r[i] / (frames[i] / sr)

In a nominal scenario, the load is below 1.0: it took less time to render the audio than it takes to play it out. In an overload scenario (called under-run in the audio programming jargon), the load can be greater than 1.0. At this point, it is expected that the user will hear audio dropouts. This provokes discontinuities in the audio output and is very noticeable.

Because the time it takes to render the audio is usually directly controllable by authors (for example, by deciding to reduce the quality of some parts of the audio processing graph, that are less essential than others for the application), authors would like to be able to observe this load.

A real-life example that could benefit from this new API would be the excellent https://learningsynths.ableton.com/. If you open the menu by clicking the icon on the top left (on desktop), and scroll down this panel, you see that the render quality is controllable.

Similarly, it's not uncommon for digital audio workstations or other professional audio software to display a load indicator in their user interface, to warn the user that there's too much processing for the system in its current configuration.

In the Web Audio API spec, this is defined in the section Rendering an audio graph.

Discussed Aug 21, 2023 (See Github)

https://github.com/w3ctag/design-reviews/issues/846

Discussed Sep 4, 2023 (See Github)

Sangwhan: multiple subsystems in the web platform... webrtc and webaudio... webaudio is carved out as its own island. Mainly because of the latency requirements... This proposal is trying to address this - that's great - but only for webaudio. WebGPU and WebRTC... Maybe we could discuss at TPAC because we have a necessity to solve this problem.

Dan: Noting multiple stakeholders - Google and Mozilla people involved.

Sangwhan: Converging worker and web audio...

Comment by @cynthia Sep 7, 2023 (See Github)

This, compute pressure, and the worker QoS proposal seems to be all somewhat connected in terms of serving this kind of compute time guarantee needs (or lack of guarantee thereof) - would it make sense to distill some common patterns out of this for consistency?

Comment by @hoch Sep 7, 2023 (See Github)

That's an interesting suggestion. However, the level of precision in Compute Pressure API is not enough (4 buckets) and the design of the Worker QoS proposal seems quite distant from this API. (i.e. you're setting the option at construction time)

Based on the developer survey we conducted, the bucket size 4 is not suitable for anything useful. Another approach that we're discussing at the moment is using the strong permission signal (e.g. microphone) to allow the full precision of capacity value. Conversely, the API only offers limited buckets (~10) without explicit user permission.

Discussed Jan 1, 2024 (See Github)

Sangwhan: similar to compute pressure, but bound to audio worker. Exposes less information. All you know is given the peak compute capacity of the audio worker, where are you sitting when it comes to the amount of pressure, you get that through an event. A level of anonymity where... you could do a side channel from a different tab, but less rewarding than doing it through compute pressure. Use case proposed is to reduce the amount of compute time allocated in the event you are experiencing pressure on the audio worker. If you have high pressure on the audio worker and you're not adequately responding to that, audio glitches. I think it's unfortunate we're trying to introduce two different patterns for this, I've comment. A related proposal on webworkers.. so 3 conflictingish things related to compute capacity and quality of service guarantees. If we let all 3 ship we're going to have inconsistencies in the platform. COmmented but don't think there has been any progress made. I don't think audio wg will revisit this with a new design.

Martin: seems like it provides a lot of granular informationa bout the load on the machine that does escape the sandbox - dependant on what others are doing as much as on what this particular app is doing.

Sangwhan: correct, you can use it as a side channel, from a different origin.. most people have only one audio device so only one audio worker. Compute pressure has a much bigger problem in that sense. Doesn't mean this proposal has zero issues.

Martin: one of the nice things about an audio worker is it runs in a very high priority context, so it has access to very good timing information typically. You are less likely to be preempted by workers operating in other threads or processes, so it provides you a very clean source of timing signal. Not necessarily as much information about what the load is ordinarily.

Yves: the higher priority means you don't really know the real load of all the other things needed for the browser. That's why the compute pressure api can give completely different results from the web audio render capacity one because of that.

Sangwhan: definite characteristics specific to audio workers. Strong preference to have it not migrated across cpus. Cost to that. Quite likely pinned to the cpu

Yves: depends on architecture of the machine

Sangwhan: Valid use case. Don't like the fact we're reinventing things

Yves: it's measuring different things in different contexts. THe thing is, having the same kind of api would be good. Or a general api that sets to the context. Is the context high priority web audo things, or a web rendering thing, or something else?

Martin: alternative api, off the top of my head, if you have the ability to scale you can provide multiple implementions of your audio context and the browser can pick which one it execustes based on load. If one starts lagging it executes the one that's supposed to be faster

Yves: can this be abused to get some information?

Martin: absolutely

Yves: using the webaudio api no signal just to check what is the current load of the signal

Sangwhan: interesting appraoch.. would have the same problems.. when you switch from one version to another and it happens as a side effect from a different origin you're going to expose the same information..

Martin: not necessarily at the ame granularity

Sangwhan: wouldn't alleviate the problem of timer granularity guarantees, unless you start adding nosie which I think is a bad idea for an audio worklet. Taking a scalar and quantizing it to different buckets, so in that sense your'e reducing the amount of entropy exposed. I have bigger concerns about the inconsistencies across apis for mechanisms for providing such functionality, than privacy concerns.

Martin: explain further

Sangwhan: capacity event and compute pressure should be more consistent so there's interop, and the QoS api which takes a different approach to solve a different problem. It's a wob worker proposal so doesn't fit into this context - audio workers are less webby, this is more attached to the DOM. Maybe we can let that one slide

Amy: they responded to this with some reasons why.. Are we at a place we can close with concerns, or is there more work to be done?

Sangwhan: 20 buckets would be fine, maybe..

Martin: substantial number who say 'might'.. out of 31 people 12 said maybe.. not resounding support

Sangwhan: 100 buckets is as good as having no buckets

Yves: inconsistent responses.. 20 buckets nobody says inconsistent for my needs, but not for 100 buckets

Martin: this is one angle on the problem. Not sure this analysis is the right one to be applying in this context.

Sangwhan: maybe a way forward is to gate more granular information behind a permission and by default do it through buckets. Then we could potentially align the apis.

Amy: should we push on this? Or just close with concerns/unsatisfied?

Sangwhan: I'd want us to push - it's still malleable and they might be receiptive

drafts comment

Martin: I'd like to see some sort of analysis of what the leakage risk is. Arbritary criteria for number of buckets doesn't capture the privacy situation very well at all. What Yves was talking about before - high pri process with limited processing capacity, so less subject to compute pressure by things at a lower priority

Yves: if you see the number of buckets of .... of the signal there are studies that show that 2 buckets is enough to reconstruct the signal, you just need more time and more samples. Not really a big issue

Sangwhan: is it already not possible as of today?

Martin: that's the wrong line of argumentation to take. If it's already possible that's a vulnerability in the platform, not an excuse to not bother

Sangwhan: there has been work done by compute pressure, they've experimented with different buckets, tried to do a cross origin side channel communication. They have a proof of concept. How much time proportionate to the granularity of the bucket

Martin: that would be useful

Sangwhan: I can share that. How that works in an rt priority single threaded setup like this is unclear. I'd imagine similarly.

Martin: keep in mind that the individuals producing such analysis are motivated for it to produce a certain result. I'd be more confident if it were independant. If someone contracted an academic to try to break their stuff.

Discussed Jan 1, 2024 (See Github)

Dan: we need to re-assign ...

reassigned to Tess & Matthew

Comment by @cynthia Jan 23, 2024 (See Github)

Sorry for the long delay. We've discussed this during our F2F, and having some level of consistency/interoperability between this proposal and compute pressure would be a better architectural direction. (Setting aside QoS and how to make that proposal consistent, as it seems much earlier stages)

Some questions for you:

Can you consider to have a common interface for pressure signal shared between Compute Pressure and RenderCapacity? If not, why?
Can you consider using an Observer for your use case? If not, why? (We will ask the opposite about Compute Pressure and events)
Where does the working group stand with respect to limiting the granularity of the pressure? Do you have agreement about limiting granularity in the absence of some gating function, like gaining permission for microphone or similar?

With all of these questions, we think the use cases are valid so no questions there.

Comment by @kupix Feb 21, 2024 (See Github)

There is a way to estimate render capacity that works today: capture timestamps before and after processing on the audio thread. This method isn't without challenges: timestamps can only be captured with 1ms precision (thanks Spectre) and the audio chunk rate may beat with the 1kHz timestamp clock. So aggregation/estimation requires a period of the order of 1 second to stabilise although this could probably be improved with more sophisticated timestamp processing.

See it in action: https://bungee.parabolaresearch.com/bungee-web-demo.

There may be a further challenge with adapting processing complexity according to render capacity. Occasionally something (browser or OS) seems to detect a lightly used thread and either move it to an efficient or low-clocked core. So, paradoxically, faster render code can sometimes result in increased render capacity. This is a "denominator problem" that needs more study.

Simple sample below (simpler averaging than link above).

class NoiseGenerator extends AudioWorkletProcessor {
  constructor() {
    super();
    this.active = this.idle = 0;
  }

  process(inputs, outputs, parameters) {
    const start = Date.now();
    if (this.idle) {
        this.idle += start;
        console.log("Render capacity: " + 100 * this.active / (this.active + this.idle + 1e-10) + "%");
    }
    this.active -= start;

    // generate some noise
    for (let channel = 0; channel < outputs[0].length; ++channel)
      for (let i = 0; i < outputs[0][channel].length; ++i)
        outputs[0][channel][i] = Math.random() * 2 - 1;

    const finish = Date.now();
    this.active += finish;
    this.idle -= finish;

    return true;
  }
}

registerProcessor('noise-generator', NoiseGenerator);

Discussed Mar 18, 2024 (See Github)

Matthew: i've done some work here.. but there's a new review request that came in 2 weeks ago - play-out statistics - that looks similar ...

Yves: would say

Discussed Mar 25, 2024 (See Github)

Matthew: I think they've answered most of the questions we asked. They seem to've moved closer to the Compute Pressure API appraoch (with buckets) since our review started. Would it be useful to share the info from DAS WG about side-channel attacks with the Web Audio group?

Martin: Seems reasonable. Doesn't say how many buckets.

Matthew: They're still deciding that.

Martin: Risk is that due to things like CPU throttling, attacker can learn about the nature of your machine. Though these things tend to be high latency. This is a low bandwidth channel between origins.

Martin: Important to remember that attacks only get better over time.

Martin: Mitigations seem OK. I think they only need a handful of buckets.

Tess: They did a developer survey that came up with the conclusion that 4 buckets isn't enough. They seem to want 10.

Martin: I think 10 may be OK based on those stats - though web sites will always want to know as much as possible.

Martin: Dropping some buckets (reduced resolution) below 50% capacity seems reasonable, as you aren't as interested in that zone.

Tess: Do you think 10 would be fine without a permission prompt?

Lea: I would worry this may contribute to permission prompt fatigue.

Martin: Could be OK without permission. If allowing more buckets with an open mic (for example), the number shouldn't be a huge amount more.

Matthew: that case (open mic permissions) is mentioned - but for quite a high extra number of buckets.

Lea: They do state they chose linear bucketed approach.

Martin: They wanted events for when it changes, which I think suggests pure buckets.

Lea: That's a little strange, though, as it depends on the number of buckets. I thought the resoultion of the buckets could change in future if they're not serving needs. Does events make this a problem?

Martin: No. If we ship with 5 buckets, and then go with 10, they'll get events with whatever the value is more often.

Lea: I'm not following.

Martin: If the value that's exposed is rounded, then if the code is "if load is below x, change to a lower/higher resolution thing" then it's fine.

Lea: Developers may make decisions about code they can run on an event depending on how frequently it fires.

Martin: Sure but that'd only change dramatically if you added a lot. I think this is probably acceptable in the sense that if you expected a low rate of events, but got a higher one, some adaptation may be needed, but I don't expect it will change that much (in which case some signal processing would be needed).

Lea: Does "update" make sense as an event name? I guess it's on render capacity.

Martin: It's OK.

Lea: I think "change" might be more compatible with other event names.

Matthew: How similar do we think this should be to Compute Pressure API? Same interface, different buckets OK?

Martin: Compute Pressure is about general compute resources, this one is more about the ability of an Audio thread to maintain performance.

Matthew: This seems consistent with the past discussion on this topic - sounds like we're OK with what they have.

Martin: Yes, this seems fine - though I would push for a lower number of buckets than what they have, as it can always be changed.

Lea: In that case we could suggest non-linear buckets.

Matthew: That's different to Compute Pressure (but OK if justified, as it seems, and documented).

Martin: Compute Pressure is labelled (nominal, fair, etc...), so different already.

Peter: Don't see a problem with non-linear, which might reduce entropy for some people. That might be better for privacy.

Lea: Beyond bucket sizes/boundaries, what's the DX? Starting/stopping listening. If you have code that draws the graph... where does this fit in? Continuous monitoring, or only some of the time?

Martin: I imagine with an event-based model, you listen and receive the event. Switch level based on threshold.

Lea: When do you decide to listen?

Martin: Always.

Lea: If you listen to it always, why do you have to opt in?

Martin: So the browser knows where the event to be delivered.

Lea: In that case, could the event be on the AudioContext? It seems that starting and stopping is adding friction. If it was performance intensive to listen to it, that's one thing, but if you listen consistently, this is boilerplate.

Peter: [ scribe missed this; sorry! ]

Lea: I thought this is basically throttling.

Martin: I understand this is a window over which it's averaged ("over a period of half a second, tell me what the load was") - in which case having start and stop makes sense.

Lea: There could be a property that sets the granularity.

Lea: Meta point: this is why we have labels in the new process for these things - we need to ensure that API design issues (for example) are covered, as well as security and privacy (etc.)

Matthew: How to handle >100% with buckets?

Martin: There was a counter proposed (which presents a possible side channel) but if you just report "over 100%" then you're in a pretty good position.

Matthew: There was discussion that the capacity can be computed now (with timers) - should we be concerned about that?

Peter: It's not clear why they went with the event-based approach over polling, from the explainer doc.

Peter: event name "change" works for events fired when it changes but if it's on an interval, then "update" is more apt.

Martin: +1 (as above)

Lea: We should have a principle on this (interval vs changes)

Peter: +1

Peter: Also on this one I think it shoudl be a change event (only fired when it changes). When you start observing, you could say how much of a change you want to trigger a firing of the event.

<blockquote> Hello there! We looked at this today during a breakout.

Other TAG members will comment on other parts of the proposal, but we had some questions wrt API design. We need to better understand how this API fits in to the general use cases where it will be used. Currently, the explainer includes a snippet of code showing this in isolation, where it is modifying parameters in the abstract. What is the scope of starting and stopping this kind of monitoring for the use cases listed? Are authors expected to listen continiously or sample short periods of time (because monitoring is expensive)?

If they are expected to listen continuously, then are start() and stop() methods required? If the only purpose of these is to set the update interval, that could be a property directly on AudioContext (in which case the event would be on AudioContext as well and would be named in a more specific way, e.g. rendercapacitychange).

We were also unsure what the update interval does exactly. Does it essentially throttle the event so you can never get more than one event per that period? Does it set the period over which the load is aggregated? Both? Can you get multiple update events without the load actually changing?

Lastly, as a very minor point, change is a far more widespread naming convention for events compared to update, see https://cdpn.io/pen/debug/dyvGYoV update makes more sense if the event fires every updateInterval regardless of whether there was a change, but it produces better DX to only fire the event when the value has actually changed so that every invocation is meaningful.

</blockquote> <blockquote> We think that the general approach to managing side-channel risk is acceptable.

Overall, fewer buckets would be preferable; at most 10, though preferably 5. Though surveys indicate that some number of sites would be unhappy with fewer than 10 buckets, there is an opportunity to revise the number of buckets over time based on feedback on use. Increasing the number of buckets should be feasible without affecting site compatibility. Starting with a more private default is the conservative option. Increasing resolution carries a small risk in that change events are more likely to occur more often (see API design feedback).

More detail is ultimately necessary to understand the design:

Is hysteresis involved?
Is the reported value a maximum/average/percentile?
What happens when the load exceeds 100%?

</blockquote>

Comment by @LeaVerou Mar 25, 2024 (See Github)

Hello there! We looked at this today during a breakout.

Other TAG members will comment with other components of the review, but we had some questions wrt API design. We need to better understand how this API fits in to the general use cases where it will be used. Currently, the explainer includes a snippet of code showing this in isolation, where it is modifying parameters in the abstract. What is the scope of starting and stopping this kind of monitoring for the use cases listed? Are authors expected to listen continuously or sample short periods of time (because monitoring is expensive)?

If they are expected to listen continuously, then what is the purpose of the start() and stop() methods? If their only purpose is to set the update interval, that could be a property directly on AudioContext (in which case the event would be on AudioContext as well and would be named in a more specific way, e.g. rendercapacitychange).

We were also wondering how this relates to https://github.com/w3ctag/design-reviews/issues/939 ?

Comment by @martinthomson Mar 25, 2024 (See Github)

An addendum on security... We think that the general approach to managing side-channel risk is acceptable.

More detail is ultimately necessary to understand the design:

Is hysteresis involved?
Is the reported value a maximum/average/percentile?
What happens when the load exceeds 100%?

Discussed Apr 22, 2024 (See Github)

Still pending feedback.

Discussed Jul 1, 2024 (See Github)

Pinged proponents.

Comment by @martinthomson Jul 1, 2024 (See Github)

@hoch, @padenot, do you have any feedback on the questions above?

Comment by @padenot Jul 2, 2024 (See Github)

This is somewhat in pause for now at the Audio WG level, implementors aren't exactly sure how to ship this.

Discussed May 19, 2025 (See Github)

Jeffrey: We should just ask the proponents what's going on with this. If this is being discussed in the Privacy WG, we don't need to add our own privacy opinions.