design-reviews#1189: Incubation: Web Speech API: On-Device Recognition Quality

#1189: Incubation: Web Speech API: On-Device Recognition Quality

Visit on Github

Opened

Feb 2, 2026

Explainer

https://github.com/WebAudio/web-speech-api/blob/main/explainers/quality-levels.md

The explainer

Includes the information requested by the Explainer Explainer.
Follows the Web Platform Design Principles.
Includes or links to answers to the Security/Privacy Questionnaire.
Describes user research you did to validate the problem and/or design.

Where and by whom is the work is being done?

GitHub repo: https://github.com/WebAudio/web-speech-api/
Primary contacts: Evan Liu (evliu@google.com), Google, Author
Organization/project driving the design: Google Chrome
This work is being funded by: Google
Incubation and standards groups that have discussed the design: Audio WG, TPAC 2025
Standards group(s) that you expect to discuss and/or adopt this work when it's ready: Audio WG

Feedback so far

Multi-stakeholder feedback:
- Chromium comments: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/69694ed0.050a0220.f8796.0337.GAE%40google.com
- Mozilla comments: Not yet, (unofficially favorable)
- WebKit comments: Not yet
Major unresolved issues with or opposition to this design: N/A

You should also know that...

Summary: Extends the SpeechRecognition interface by adding a 'quality' property to SpeechRecognitionOptions. This allows developers to specify the semantic capability required for on-device recognition (via processLocally: true). The proposed quality enum supports three levels: 'command', 'dictation', and 'conversation'.

Specification URL: https://webaudio.github.io/web-speech-api

Track conversations at https://tag-github-bot.w3.org/gh/w3ctag/design-reviews/1189

Discussions

Discussed Feb 2, 2026 (See Github)

Christian and Matthew are assigned.

Discussed Feb 9, 2026 (See Github)

Christian: Marcos mentioned this has the same issue as PromptAPI, you can query if there is a model present on the device suitable for a certain level/language, this could be a fingerprint vector. I want to reccomend they look at Prompt API review.

Ehsan: Isn't this the same problem with all language based models? Maybe we can have a consistent answer.

Lola: Do you suggest we should have a document on language-based models?

Ehsan: That would be my suggestion. Come up with a document that describes all of that. Should be done at the WebML groups. Think this is coming up more often.

Marcos: It’s a more general problem of downloading system components which then you can query, because then they become global.

Lola: To make this even more general, should we have a position on downloading system components? Or is this restricted to this use case?

Marcos: No, could be related to everything. Codecs, etc. Should be a design principle.

Lola: Who would be willing to write that? We also have another plenary before the F2F.

Christian: Could offer to do that, would be my first design principle, and a topic where I’m interested in.

Ehsan: Same here. Would be good to have a more experienced TAG member on that as well.

Lola: Design principles is owned by Jeffrey, so we can talk to him about that.

Discussed Feb 16, 2026 (See Github)

Christian: We asked a question here. There is the fingerprint-ability concern. We also talked about when we review the prompt API, thinks that you can basicaly download AI models to the system as a global compenent, and in choosing languages and other options, it becomes fingerprintable.

That was our first reaction. Web speech API with local processing is already there, it exists. It can already download AI modules and the previous tagest with concerns to that. so now they are just adding the quality level, which is a minor change but adds to the fingerprintability.

We are now waiting for a response.

Hadley: maybe if we don't hear back next week, let's nudge?

Christian: sure

Comment by @christianliebel Feb 19, 2026 (See Github)

Hi @evanbliu, thank you for your proposal.

We have one question regarding privacy: Could an attacker fingerprint the user’s browsing history by installing certain or rare languages along with model qualities on site A, and checking for the availability of those permutations on site B? And if so, how is that fingerprinting concern mitigated?

It would be great if you could add a security & privacy questionnaire and answer that question.

Discussed Feb 23, 2026 (See Github)

(Christian/added before the meeting: This is pending external feedback.)

Comment by @evanbliu Mar 4, 2026 (See Github)

Hi @christianliebel,

Thanks for raising this! It's worth noting that this proposal doesn't actually introduce on-device speech recognition, as it's already part of the existing spec. The same fingerprinting concerns you brought up are already present for current on-device speech recognition and are are mitigated by the countermeasures detailed in this PR: https://github.com/WebAudio/web-speech-api/pull/165. These mitigations are based on those developed for the Writing Assistance APIs (https://webmachinelearning.github.io/writing-assistance-apis/).

Let me know if you have any concerns or questions!

Discussed Mar 16, 2026 (See Github)

Christian: this is related to the Global Browser Component topic we discussed in our f2f. Previous TAG said saistisfied or satisified with converns, but not for the baicl local web speech API. Now they want to extend that local API by giving you a quality property, the level of speech recognition quality. Marcos looked at the pull request. Hard to say anything other than satisfied by concerns because we already said that.

Marcos: perfect summary.

Christian: shall we close it with satisfied with concerns, where the concerns being fingerprinting?

Matthew and Marcos: yes

Discussed Mar 30, 2026 (See Github)

Christian: We feel this will be satisfied with concerns. We asked them to add fingerprintabiltiy information, but this was ignored. They said it's been covered already, but adding the quality dimension adds a lot of fingerprinting vector.

Marcos: We can see the problem again where we're closing issues but then have no mechanism to follow up - e.g. Christian said it would be nice to see certain things. Maybe we should file bugs relating to those concerns? Then they can't avoid the feedback we give them.

Lola: We've spoken about this in relation to other issues.

Matt: We have process for what you just described Marcos. We can see in one place when groups have closed issues, if they've addressed resolutions, etc. We have it and we could use it.

Christian: Should we try out that process here?

Matt: I'm working on guidance for how to do this, I'm happt to take that and make it applicable for TAG but docs are needed. Happy to take it on and work with chairs.

Lola: Let's write the closing comment for this, and then figure out with chairs, Matthew, and anyone else, how to track things.

Christian: Let me know on the proosed comment.

Discussed Apr 6, 2026 (See Github)

Christian: Last time we weren't sure what to do about the fingerprinting risks. It tends to make fingreprinting worse. The proponents ignored our request to add a S&P section. The alterntive was that we create an issue in their repo to track it there.

Marcos: S&P is a requirement for any W3C spec.

Matthew: Could show you how to track issues, if somebody is interested. Reach out to me if you want to learn how tracking works.

Ehsan: Curious to know if they have communicated the reason for not adding S&P section? The choice of on-device processign reflects an interest in privacy.

Christian: Their argument is that having a local model means their S&P considerations are already there.

Marcos: They mention mitigations; they've not landed.

Ref https://github.com/w3ctag/design-reviews/issues/1189#issuecomment-4000423542

There's a larger concern around Web Speech that contributions are Chrome-only, and lack of implementation commitments. Though I see some from Mozilla.

Yves: It would fail the TR '2+ implementations' test if that was a problem at transition time.

Marcos: This is confusing as it seems to be incubation work, but being done within a WG. Checks charter. Will file a bug.

... They're saying they think they solved it through un-merged PRs on the spec. We could push back and say it'd be helpful to give people a summary in the explainer, as it's confusing.

Christian: We could conclude the review here, and ask them to add a link in the explainer at that time.

Matthew: Question is, we would be happy if the PRs are merged. "Assuming the PRs are merged, and the link is added…"

(Consensus.)

Comment by @evanbliu Apr 9, 2026 (See Github)

Does anyone have any other comments on this issue? I've included responses to the security & privacy questionnaire below:

2.1. What information does this feature expose, and for what purposes? Exposure: The feature exposes the availability of specific on-device speech recognition capabilities (categorized as 'command', 'dictation', or 'conversation') for a given language. Purpose: This exposure allows web developers to specify the semantic capability required for local, on-device speech recognition (when processLocally: true is utilized). This helps optimize the underlying engine's performance, accuracy, and power consumption based on the specific task the user is performing.

2.2. Do features in your specification expose the minimum amount of information necessary to implement the intended functionality? Yes. The proposal restricts the exposure to a simple, predefined enum of three distinct quality levels. It does not expose granular details about the user's specific hardware, the exact machine learning models installed, or the underlying operating system's native speech APIs.

2.6. Do the features in your specification expose information about the underlying platform to origins? This API does not introduce on-device speech recognition itself (which is already part of the existing spec). The fingerprinting concerns associated with model availability are addressed via mitigations detailed in WebAudio/web-speech-api#165. These countermeasures are modeled after the Writing Assistance APIs, which typically mitigate this by downloading models on demand (rather than revealing pre-installed state) or by standardizing the availability of core models to reduce entropy.

2.7. Does this specification allow an origin to send data to the underlying platform? Yes. The proposal allows an origin to pass a specific quality constraint through the browser to the underlying platform's local speech recognition engine to configure how the audio stream is processed.

2.8. Do features in this specification enable access to device sensors? Yes (Inherited). While this specific proposal only adds an options property, the underlying Web Speech API intrinsically requires access to the device's microphone. This proposal relies entirely on the existing permissions model, user prompts, and security indicators currently established for microphone access in the browser. It does not introduce new sensor access mechanisms.

2.13. How does this specification distinguish between behavior in first-party and third-party contexts? The proposal does not explicitly introduce new behaviors for third-party contexts. However, like the broader Web Speech API, microphone access (and therefore the ability to use this feature) should be governed by Permissions Policy. Third-party iframes would require explicit delegation (e.g., allow="microphone") from the first-party context to utilize speech recognition at any quality level.

2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode? The API should function similarly to standard browsing, provided the user grants microphone permissions. However, to prevent cross-session tracking, browsers may need to apply stricter model-download heuristics in Private Browsing. For example, if a specific 'dictation' model for a rare language is downloaded during an Incognito session, the browser must ensure that the availability of this newly cached model is not exposed to subsequent standard browsing sessions, and vice versa, to prevent linking the two profiles.

Discussed Apr 27, 2026 (See Github)

CL: everything is clear. We have a draft response ready. We just need to post it. So I will post it. 🎉

Comment by @christianliebel May 1, 2026 (See Github)

@evanbliu Thank you for your patience. We are aware that on-device speech recognition is already part of the specification. As outlined in https://github.com/w3ctag/design-reviews/issues/1038, we believe that on-device speech recognition is a valuable enhancement to web capabilities.

The additional quality field helps determine whether the device can achieve the requested quality level, enabling developers to fall back to alternatives when the quality bar cannot be met. However, introducing recognition quality levels introduces additional permutations that could exacerbate fingerprinting concerns. Previously, an attacker could query for specific languages. Now there's a combination of languages and recognition quality to consider.

That being said, we are closing this design review as satisfied with concerns, with the primary concern being the additional fingerprinting vector. We request that you ensure the specification considers robust countermeasures against fingerprinting (see https://github.com/WebAudio/web-speech-api/pull/165), and that the responses to the Security and Privacy questionnaire are added to the explainer.