#1038: On-device Web Speech API

Visit on Github.

Opened Jan 9, 2025

こんにちは TAG-さん!

I'm requesting a TAG review of on-device support for the Web Speech API.

This feature adds on-device speech recognition support to the Web Speech API, allowing websites to ensure that neither audio nor transcribed speech are sent to a third-party service for processing. Websites can query the availability of on-device speech recognition for specific languages, prompt users to install the necessary resources for on-device speech recognition, and choose between on-device or cloud-based speech recognition as needed.

2.2. Do features in your specification expose the minimum amount of information necessary to enable their intended uses? Yes. Some websites may have strict privacy requirements that require on-device speech recognition so websites must know if it's possible to ensure that neither audio nor captions are sent to a third-party service for processing.

2.6. Do the features in your specification expose information about the underlying platform to origins? While this feature does not directly expose information about the underlying platform, websites may potentially use performance metrics for on-device speech recognition to gauge general hardware capability.

2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections? Yes, the spec contains a section on how to reduce risk of fingerprinting. Websites needs explicit user permission to install on-device speech recognition language packs that do not match the user's preferred language or if the user is not on ethernet or Wi-Fi.

Further details:

  • I have reviewed the TAG's Web Platform Design Principles
  • The group where the work on this specification is currently being done: Audio Community Group
  • The group where standardization of this work is intended to be done (if different from the current group): Audio Working Group
  • This work is being funded by: Google

You should also know that... The primary risk of this new functionality is the potential for fingerprinting. To mitigate this risk, the Chrome Trust & Safety team proposes requiring explicit user consent to install language packs that do not match one of the user's preferred languages or if the user is not on a Ethernet/Wi-Fi network.

The existing Web Speech API has an outdated callback design which must be maintained due to backwards compatibility/interoperability issues. While Firefox doesn't officially support the speech recognition section of the Web Speech API, it has a unprefixed implementation behind a flag and most of the guides on how to use the Web Speech API do something like window.SpeechRecognition || window.webkitSpeechRecognition; (Examples from developer.mozilla.org, codeburst.io, dev.to) and there are 17.8K instances of this kind of usage on Github alone. The Audio Working Group is looking into potentially replacing this API with a new, modernized version under a different name. A separate TAG design review will be sent for that if the group decides to proceed with the new API.

Discussions

Log in to see TAG-private discussions.

Discussed Jan 20, 2025 (See Github)

None of us have done the homework on this one; we'll read up on it and take another pass next week.

Comment by @RByers Jan 22, 2025 (See Github)

Note Blink API owners are actively considering (and will likely soon approve) an I2S on this.

An explainer is under review here

Discussed Jan 27, 2025 (See Github)

[Discussion of Jeffrey's suggested comment.]

Google Meet might want to do on-device recognition iff all languages spoken are available. The proposed API doesn't support good UI for this: we'd want a single request that lists all the needed languages.

General support for having a boolean where the site requires recognition to be done locally, and then browser UI asking to download a model for any use of speech recognition.

Peter: also mention that it would be nice if users could pick their cloud recognizer.

Jeffrey's suggestion for Plenary:

<blockquote>

Being able to recognize speech on a user's local device seems like a great advance, and we strongly support it.

We note that this is a prefixed API with an old design that it might have been good to update in the process of unprefixing it. Unfortunately, documentation violated https://www.w3.org/2001/tag/doc/polyfills/#don-t-squat-on-proposed-names-in-speculative-polyfills and recommended assuming the prefixed and unprefixed names would behave identically, so that option is closed.

We see sites' desire to guarantee that speech isn't sent outside of the local device (for example to a cloud speech recognizer operated by the browser), especially for end-to-end-encrypted (E2EE) services. This is hard to actually guarantee: a valid browser architecture, especially for low-power devices, is to run the entire browser in the cloud and stream its UI down to the user. But with that caveat, we support giving the site a way to encourage the browser to keep the speech local.

We don't see a reason to let sites require the recognition to happen in the cloud if the user prefers it to happen locally. That is, recognition.mode==cloud-only" doesn't seem like an option that should exist. Perhaps just recognition.localOnly = true or onlyOnDevice, or have sites check the return value of the requestLocalRecognition([lang1, lang2, ...]) function suggested below?

We wondered why recognition.onDeviceWebSpeechAvailable() and recognition.installOnDeviceSpeechRecognition() take a lang parameter when recognition.lang exists. @jyasskin asked you via a side-channel and heard that a video-conference system might have participants on a call speaking several languages and would want to recognize locally only if all the participants' languages are supported. We think there's a better design for this along two axes:

  1. Having each listener spend CPU cycles to recognize audio seems wasteful compared to having either the sender or the cloud recognize the speech once. In an E2EE system, the cloud might not be able to, but putting the responsibility on the sender seems more efficient, both in terms of the number of language pack downloads and the CPU time for recognition.
  2. If a site wants to know if it can recognize in multiple languages, it should request downloads of all those languages in a single permission prompt. So an API more like SpeechRecognition.requestLocalRecognition([lang1, lang2, ...]). This request function also seems to mitigate the fingerprinting risk better than a query function that can ask which languages are present without showing the user a prompt.

Browsers should also proactively offer language pack downloads to people using sites that don't know about this option. Could the spec suggest this? Would an omnibox icon be an appropriate UI in your Chromium implementation?

Similarly, there might be cases where a user actively wants the recognition to happen in the cloud, for example if their battery is unusually low, or (speculating) to use a particular cloud service if there are several options with privacy tradeoffs. The same UI might be appropriate for giving them that control.

Nits:

  • onDeviceWebSpeechAvailable looks like it's going to be an event handler because of the initial on. Above, we suggested a design that doesn't need this name.
  • "download over a cellular network" isn't quite the right condition for guessing that the user might prefer not to pay for the download. https://github.com/tomayac/netinfo/blob/relaunch/README.md#metered-connection suggests "metered", but that's not adopted into a working group yet.
</blockquote>
Discussed Feb 3, 2025 (See Github)

Jeffrey: wrote up a very long proposed comment https://github.com/w3ctag/design-reviews-private-brainstorming/issues/98#issuecomment-2634875066

Martin: This might reasonably live on the user agent, which is a prerequisite. Domenic proposed a typing-completion API, which seemed to require a large model. So maybe let the website bring the necessary compute. Having a browser API means browsers are forced to be in the middle.

Jeffrey: And there's some work in Chromium about letting websites share model downloads, which is difficult for privacy reasons.

Martin: Speech recognition might be small and useful enough that we'd assume they're present.

Jeffrey: But that raises questions for the multi-language support.

Martin: If you're recognizing the language the user of the device is producing, that's likely ok. But recognizing a remote person's language, maybe you can't assume that's available.

Martin: Question of local vs cloud is important and needs to be resolved. I think that before we add APIs for this, there should be a reasonable expectation that it'll be local. A lot of the discussion so far has been about LLMs, which are big.

Xiaocheng: Huawei is interested in enabling on-device models. It's a provider of both the device and OS and browser. What approach is appropriate? Query a local model registry?

Jeffrey: Lots of entities are interested in that (e.g. ChromeOS, Apple, Intel). It's difficult to both support hardware advances and work across all platforms, and especially to support browsers that aren't tied to specific hardware or OSes.

Martin: When it comes to larger models, or experimental models, doesn't make sense to have a web API just yet. Let sites experiment, and identify commonalities later. In the case of speech recognition, which is a core OS feature, making that core OS feature available seems reasonable. Have to do it without local personalization.

Jeffrey: Talk to Domenic Denicola.

Jeffrey: I'm hearing concern about cross-language packs, but not concern about the capability at all. I'll edit that into my comment before plenary.

Martin: Distinguish speech recognition on device for form entry, from this capability where the site recognizes.

[Discussion of the permission shape. This spec doesn't cite getUserMedia.]

Jeffrey: The spec isn't in good shape, but I don't know that it's worth saying that given all the other commentary.

Martin: Call out the fingerprinting risk of onDeviceWebSpeechAvailable().

Martin: Bilingual people may speak multiple languages in one chunk of text. The models can often deal with that, but this API can't.

Jeffrey: That's probably a larger change than they're hoping to do in this update.

Martin: But maybe they should do it right. I think this is dated enough that it needs work.

Martin: Themes: where the model lives; API shape in general; unnecessary fingerprinting risk. Will talk to Paul Adenot.

Jeffrey: My sense is we should make sure the new parts are good, but leave the old thing alone.

Martin: Vs build a new thing to replace the old thing. You probably want a stream. Then you'd drop the integrated microphone permission, you'd add language to the stream (where unrecognizable text would say [unrecognizable text]). Fingerprinting exposure is related to the text you put in, where the audio-stream input makes fingerprinting worse.

Jeffrey: Would prefer not to tell them definitively to do this over, but we might request that they think about alternatives more thoroughly.

Martin: Yes.

Jeffrey: Will update the comment for plenary.

(side discussion: the question about what we might do about how models are run is one that the TAG should perhaps think about putting a finding out)

Discussed Feb 10, 2025 (See Github)

Jeffrey: How's my new shorter comment?

Martin: Looks good. We should use some of the other comments when thinking about the space.

[commented]

Comment by @jyasskin Feb 11, 2025 (See Github)

We discussed this in a breakout today:

We think it's a useful advance to help websites recognize a user's speech on their local device, in line with our data minimization guidance. We see in the I2S thread that the CG is actively iterating on the shape of this API, and we'd like to let them come to some conclusions before we review the result. We've opened a design principles issue on the architectural issues around downloading large data files to support browser APIs, which apply across this and the Translation and Writing Assistance APIs.

We'll close this review for now, but please comment when the WG is closer to consensus or has architectural questions about the API.

Comment by @evanbliu Apr 2, 2025 (See Github)

The Audio WG has reached a consensus on the shape on the on-device speech recognition part of the Web Speech API spec. Please let me know if TAG would like to re-review this.

Comment by @jyasskin Apr 2, 2025 (See Github)

Could you update the explainer to describe the new consensus shape for the API?

Comment by @evanbliu Apr 2, 2025 (See Github)

Oops, I just sent out this PR updating the explainer: https://github.com/WebAudio/web-speech-api/pull/149

Discussed Apr 14, 2025 (See Github)

Proposed comment at https://github.com/w3ctag/design-reviews-private-brainstorming/issues/98#issuecomment-2803367736

Christian: How should this align with the Speech Synthesis API?

Jeffrey: Could just ask them to express an opinion.

Dan: In the use case they have in mind, if the UA doesn't have local recognition, does the use case want to proceed or stop?

Jeffrey: I think they have both in mind.

Christian: They don't really explain this use case. Explainer

Dan: Think a low-end device should be allowed to lie.

Christian: Want to be in control. Same as speech synthesis: if you're offline right now, you have to say "want to do this locally." Want the chance to express your intent.

Xiaocheng: Does the cloud-only mode restrict its capabilities in a PWA?

Jeffrey: It would, which is a reason not to have it.

Consensus to post the below comment:

<blockquote>

We're enthusiastic about helping websites recognize a user's speech on their local device, in line with our data minimization guidance. Thank you for working on that.

With that said, we'd like to request several clarifications and API improvements:

Comment by @jyasskin Apr 15, 2025 (See Github)

We discussed this in a breakout today, and we're enthusiastic about helping websites recognize a user's speech on their local device, in line with our data minimization guidance. Thank you for working on that.

With that said, we'd like to request several clarifications and API improvements:

Restricting recognition location

We see sites' desire to guarantee that speech isn't sent outside of the local device (for example to a cloud speech recognizer operated by the browser), especially for end-to-end-encrypted (E2EE) services. This is hard to actually guarantee: a valid browser architecture, especially for low-power devices, is to run the entire browser in the cloud and stream its UI down to the user. Similarly, a UA running on a low-power devices might prefer to run recognition in the cloud, perhaps to improve recognition quality. With those caveats, some of the TAG supports giving the site a way to encourage the browser to keep the speech local. Or, perhaps equivalently, for the site to tell the UA that it's going to keep the audio constrained to as few devices as possible, as a hint that the UA should do the same. This should not be a strong UA requirement, which means that it should not be called ondevice-only.

We don't see a reason to let sites require the recognition to happen in the cloud if the user prefers it to happen locally. Forcing a user to send unnecessary data to their UA's servers would violate our data minimization principles. So we recommend that recognition.mode==cloud-only" not exist. Perhaps just recognition.localOnly=true or recognition.onlyOnDevice=true, or have sites check the return value of the requestLocalRecognition([lang1, lang2, ...]) function suggested below?

We also note that there's an existing SpeechSynthesis API that mentions localService for voices. Could you add an analysis of how you're intentionally matching or diverging from that design?

Recognizing other users' speech

It's not clear whether it makes sense to have the browser download language packs to help websites recognize speech that comes from sources other than the current user speaking. This hesitance comes from two angles:

  1. It seems more efficient to recognize speech once, and share the recognized text to all recipients, rather than having each recipient redundantly recognize it for themself. (https://w3c.github.io/sustainableweb-wsg/#success-criterion-client-vs-server-human-testable)
  2. The download costs users something, and managing that makes the API significantly more complicated.

We could use a better description of the use cases that need these costs and complexity before we'll be comfortable endorsing that capability.

If this capability is justified, a site should request downloads of all of the needed languages in a single permission prompt. Without that, the user might misunderstand how much data they need to download before it would benefit them, which compromises their ability to consent. So, if this capability is needed, we suggest an API more like SpeechRecognition.requestLocalRecognition([lang1, lang2, ...]).

Personalization

In cases that are only recognizing the primary user's speech, we'd like the specification to analyze when it's safe and useful to personalize the speech recognition to that user. This could be as simple as "UAs MUST NOT personalize speech recognition", but we think there might be some utility in letting the start() method that doesn't take a MediaStreamTrack be personalized in order to better recognize the user's speech. For fingerprinting and other privacy reasons, we think personalization is only feasible if the site can't get both the audio and the personalized recognized text, and there might be other risks or problems we haven't thought of.

Fingerprinting

We note that availableOnDevice() has some inherent fingerprinting risks, similar to those created by permissions.query(). The explainer currently just says this lets sites "determine whether to enable on-device features or fall back to cloud-based recognition", which would be enabled just as well by a better return value for the install function, probably getting inspiration from the Translator.create() interface (cc/ @domenic). What specific site UI needs this sort of no-UI query function?

We noticed an additional fingerprinting risk, that the exact version of the downloaded language pack is likely detectable and likely to skew compared to the browser's major version. The specification should identify that risk and suggest or mandate ways to mitigate it. For example, to remove the risk entirely, you could require that each browser major version can only use 1 version of each language pack and that the packs are deleted when the user upgrades to an incompatible browser version. To only mitigate it, you could remove the start(MediaStreamTrack) overload or require packs be deleted when storage is cleared.

UI and user choices

The explainer says "user agents must obtain explicit and informed user consent before installing", but that doesn't appear in the specification, and the algorithm for installOnDevice() just says the UA "can prompt the user". This seems insufficient.

While the old version of this API exists, it would be good for browsers supporting it to help their users get local recognition even on sites that haven't adopted the new options. Could the spec suggest this? Would an omnibox icon be an appropriate UI in your Chromium implementation?

Similarly, there might be cases where a user actively wants the recognition to happen in the cloud, for example if their battery is unusually low, they want improved cloud-based recognition, or (speculating) to use a particular cloud service if there are several options with privacy tradeoffs. The same UI might be appropriate for giving them that control.

Nits

  • In comparing this API to the proposed Translation API, we noticed that the proposed "download it" function here is missing an AbortSignal parameter to let the site abort the download.
Comment by @domenic Apr 16, 2025 (See Github)

Restricting recognition location

I really appreciate the TAG's nuanced perspective on this. (This is an area we're keeping an eye on for other APIs as well: https://github.com/webmachinelearning/writing-assistance-apis/issues/38.)

One technology that comes to mind is privacy-preserving cloud implementations. I believe we've seen some of these deployed in the private advertising spaces, and you could imagine them being deployed for these sorts of technologies as well.

I don't know the full details of how such technologies would be used here. E.g., is the privacy technologically guaranteed through mechanisms like TPMs, homomorphic encryption, multiple parties in the chain each of which only sees some of the data, etc.? Or the privacy mostly contractual, where the cloud model provider has some strong guarantee that they never look at the data? Do web developers and users have different feelings about these two levels of privacy guarantee?

My immediate concern is to avoid designing APIs which are future-incompatible with such technology. For example, consider a future where we designed an API of the shape { onDeviceOnly: true }, which when used means that the API only works on ~20% of devices (those with enough GPU memory). But then, one browser believes they have a strong-enough private cloud implementation, which would allow device coverage to go up to 100%. Are all sites that have chosen { onDeviceOnly: true } forever locked out of that 100% device coverage? Or do we break the meaning of { onDeviceOnly: true } to also allow private clouds? Neither alternative seems good.

I don't know what the best solution here is, but I think @jyasskin's suggestions of

With those caveats, some of the TAG supports giving the site a way to encourage the browser to keep the speech local. Or, perhaps equivalently, for the site to tell the UA that it's going to keep the audio constrained to as few devices as possible, as a hint that the UA should do the same. This should not be a strong UA requirement, which means that it should not be called ondevice-only.

seem promising.

Comment by @jyasskin Apr 16, 2025 (See Github)

Just noting that even though I posted the comment, its drafting was a team effort. In the suggestion that @domenic praised, at least @martinthomson and @christianliebel were essential to including the note that "keep it local" should just be a hint.

Comment by @evanbliu Apr 18, 2025 (See Github)

We appreciate the TAG's thoughtful and thorough feedback on our proposal to add on-device speech recognition support to the Web Speech API. We discussed this feedback at the monthly Audio Working Group meeting with representatives from Google and Mozilla. Below, we address each of the major concerns raised:

1. Restricting Recognition Location

We accept TAG's recommendation to remove the cloud-only option from the API. This aligns with the current direction of implementation across browsers—Firefox has no plans to support cloud-only recognition, and Chrome is also moving away from it. That said, we acknowledge that cloud-based speech recognition may still be preferred in certain situations, such as:

  • On low-power devices that lack sufficient compute resources.

  • When resource-intensive on-device features are already in use.

  • When cloud recognition offers better quality in specific contexts.

Importantly, we must also support use cases where audio must not be sent to third-party services, such as for regulatory or compliance reasons. While we acknowledge that confidential computing may eventually offer a viable solution for such cases, we do not intend to support that path at this time. Instead, we aim to design the API to be extensible, so that support for confidential computing can be added later if needed.

To accommodate current needs, user agents that cannot guarantee local-only processing may throw an error, allowing websites to make informed decisions based on the level of assurance required.

2. Recognizing Other Users' Speech

We recognize the efficiency argument in centralized recognition (e.g., WebRTC). However, we believe MediaStreamTrack support enables flexibility, allowing:

  • Sender-side recognition, where the originator of speech recognizes and distributes captions.

  • Receiver-side recognition

We're open to extending the installation API to support multiple language packs in a single call (e.g., installOnDevice([lang1, lang2, ...])), improving both user experience and consent clarity. We note, however, that the current API returns a Promise<boolean> indicating installation success. Supporting multiple languages could introduce ambiguity if one language fails. We will explore a more expressive return format to address this.

3. Personalization

We accept the TAG's recommendation and will add language to the specification stating that:

User agents MUST NOT personalize speech recognition.

We agree that this approach avoids fingerprinting and privacy risks while keeping the API simpler and more secure.

4. Fingerprinting

We acknowledge the fingerprinting risks posed by functions like availableOnDevice(). To mitigate these risks:

  • We will align with the Web Translation API’s privacy-preserving approach.

  • Both Chrome and Firefox will support only one language pack per language at a time, reducing the variability that could lead to fingerprinting.

  • Language packs will be cleared when browser storage is cleared, ensuring consistent behavior with other privacy controls.

For Chrome specifically, we will implement the same fingerprinting mitigations used in the Web Translation API. These are detailed in the following document: Fingerprinting Mitigations.

5. UI and User Choice

We acknowledge that some users may prefer cloud-based recognition in certain scenarios—for example, to conserve battery or access higher-quality recognition. In principle, we support giving users control over this choice through appropriate UA-level mechanisms.

However, neither Chrome nor Firefox currently plan to expose explicit UI-level controls for this. Firefox is committed to supporting only on-device speech recognition. Chrome is also moving away from cloud-based recognition and plans to phase it out over time.

The specification will be updated to use the phrasing "may prompt the user" to allow user agents the flexibility to implement privacy-preserving countermeasures in a way that best fits their platform and user base, without prescribing a specific UI. For instance, Chrome will rely on non-UI mechanisms to mitigate fingerprinting risks, rather than requiring explicit user prompts.

6. Nits

We are open to adding an AbortSignal parameter to the download/install function, consistent with modern web platform design.

We note that for Chrome's implementation of the Web Translation API, aborting the request does not cancel the actual download for privacy preserving reasons, but rather stops associated download progress events.

Comment by @jyasskin Apr 21, 2025 (See Github)

Thanks for the reply and initial changes! Here are some initial thoughts from my perspective. These haven't been vetted by the whole TAG yet:

1. Restricting Recognition Location

A cloud server that the UA uses for recognition would be a second-party service, since the user is the second party, and the server operates on their behalf. I still support a hint that the site is acting to reduce the number of machines that could see the audio, which will encourage UAs to do the same, but I don't think it can honestly be called "ondevice-only" and still give UAs the flexibility they need to act on their users' behalf. I recognize that Chrome and Firefox don't have any plans to let users offload this work to cloud services, but we should design the API to accommodate future UAs that might want to explore that direction.

I would support a statement that, at least given this hint, UAs MUST NOT expose the audio to any third parties, in case that helps these concerned partners be more comfortable with the change.

2. Recognizing Other Users' Speech

It's true that the current design enables flexibility, but it's not clear that this is user-serving flexibility. It might be, but I don't see use cases described in the explainer that show that users need the flexibility. "scenarios where personalized or local processing is preferred" should instead describe a few of those scenarios. Note that https://github.com/WebAudio/web-speech-api/pull/150/files specifically bans personalized processing, and local processing is just as possible with sender-side recognition, which reinforces the possibility that there are no such scenarios.

4. Fingerprinting

Note that the Fingerprinting Mitigations document isn't world-readable, so most of the TAG can't see it.

I generally like the direction of the Web Translation API’s approach to privacy, but the TAG hasn't fully analyzed it, so I don't want to say it's definitely enough. I think a list of use cases for the availableOnDevice() query would help the rest of the TAG get more comfortable with the idea.

Does "support only one language pack per language at a time" mean that each browser major version will be pinned to exactly one pack version per language? That's tighter than https://github.com/webmachinelearning/writing-assistance-apis/pull/47 was willing to require, but it solves the extra fingerprinting problem.

Comment by @evanbliu Apr 25, 2025 (See Github)

Thanks for the notes, Jeffrey! I haven't had a chance to meet with the Audio WG yet, but here's my attempt at clarifying some of your questions.

1. Restricting Recognition Location

The recognition.mode = "ondevice-preferred" option is indeed intended to serve as the "hint" that websites can use to express a preference for on-device speech recognition, without the UA guaranteeing it. This allows UAs flexibility, for instance, to use their own second-party cloud services if it benefits the user (e.g., on low-power devices) and the site has only indicated a preference.

However, we maintain the critical need for a mechanism that allows websites to guarantee that audio is not sent to any external service—be it a second-party or third-party service—for processing. The recognition.mode = "ondevice-only" option is designed to serve this specific requirement. If a UA cannot process the audio solely on the device (without any external network transmission of the speech data for recognition purposes), it should then throw an error when this mode is requested. This provides a clear assurance for applications with strict data residency or privacy requirements.

2. Recognizing Other Users' Speech

The mention of MediaStreamTrack support was in response to the discussion on efficiency and sustainable speech recognition. It enables developers to implement either sender-side or receiver-side captioning. The previous note on personalization was incorrect and has been removed—while local speech recognition can support personalized sender-side captioning when audio is captured via microphone, it does not apply to receiver-side captioning using a MediaStreamTrack.

4. Fingerprinting

Apologies for the inaccessibility for the document--we're unable to share the exact contents of the document at this time, but the PR you linked to provides a thorough explanation of the privacy preserving countermeasures of the Web Translation API that we're planning on adopting here.

Here are some use cases that availableOnDevice() aims to address:

  • Conditionally Offer Features: Decide whether to present UI elements or enable features that rely on on-device recognition before prompting for installation. This avoids offering a feature that isn't viable.
  • Graceful Degradation/Enhancement: Allow the site to immediately understand if on-device is an option. If not, it can fall back to alternative mechanisms (like cloud-based services it operates, or informing the user about limitations).
  • Resource-Informed UI: For example, a web application might choose to display a "Transcribe Locally" button only if availableOnDevice() indicates potential support.
Discussed Apr 28, 2025 (See Github)

Jeffrey: we posted a comment .. the came back with some replies .. i went back to them to get some use cases added. I think it would be useful to say which pieces of the reply we're happy with. Also we need to check with Martin and draft a consensus reply comment.

...Restricting recogniztion location ... they said "google meet would only adopt if guarantee that recognition is on device" but only for performance. I feel that's not a valid reason...

Dan: privacy use cases

Jeffrey: the user agent might be on a low-power phone and ship the recognition to their own desktop device.. but saying "on device only" would ban that and we shouldn't ban that.. Anything in the cloud would be in theory on the user's behalf... even if a cloud server...

Dan: skeptical of claims that "cloud processing" is implicitly agreed to by ...

Jeffrey: the proponents want the webapp to be able to specify that the processing only happens on the device .. and I think thats too restrictive...

Matthew: if you care most about performance you might care about ... devices / cloud... if you care about privacy you might care about devices that are paired... so 3 divisions: local, pretty-local and cloud... Maybe user agent covers cloud / parts of cloud... I think there probably is .. people out there who wouldn't want to send it off network...

Jeffrey: should that be the user's choice or the applicaiton's choice?

Matthew: we shouldn't do anything that restricts the UA's choice... if the UA values privacy over performance.

Jeffrey: 2nd issue: question of recognizing other users' speech. If you're in a VC situation, you might have the system recognize your own speech and send that text... or you might have the system recognize all the speech and the 2nd requires more downloads... and also requires more total processing... There is also a better API if you can recognize others' speech. Google meet is again the example web site... they have actually considered both types of captioning... ... imagine you're in a VC and one user wants to turn on captioning, and the devices are a mix some of which don't have recognition... and also I have to tell you I'm putting on captions... I think those justify it... No privacy risk over running it on audio you already have because you already have the audio... Martin may have concerns will run by him.

no concerns

Jeffrey: 3rd issue: personalization: they said "do not personalize" - which is fine.

Jeffrey: 3rd issue: fingerprinting: they gave some use cases ... they need some availability indication... for UI considerations. The writing assistance APIs now have a big privacy considerations section. We should review those. but it seems mostly OK to me. Will include that in draft comment. We suggested "let the user choose to use the local thing".. spec doesn't do enough to show the user prompt. They took our suggestions.

Jeffrey: this API doesn't take an abort signal - so we should say it should be consistent with the writing assistance APIs... They are working on that.

Matthew: one more thing ... It seems like there is going to be an increasing number of APIs that run some kind of model .. either on device, or cloud, or non-UA cloud... A consistency opportunity here to come up with a reasonably good selection of domains that would be applicable. Because then ... whether UA is interested in privacy , security etc... ... there are a plethora of UAs coming up these days. wouldn't want to preclude. What can we do as TAG to encourage consistency?

Jeffrey: sounds like a design principle...

Matthew: makes noises about drafting a PR It's not just performance, it's performance or privacy or some combination thereof...

jeffrey to draft comment based on above

Comment by @jyasskin Apr 28, 2025 (See Github)

Thanks; I think we have almost enough to re-discuss this. My last remaining questions are:

  1. What's one concrete website that might adopt this API only if it provides the mechanism to guarantee that audio is not sent across the network, even encrypted to another device controlled by the user? Have they described their requirement in public in a place the TAG could read?
  2. What's one website that needs to recognize audio on the receiver side, and can you point us to a place where they've documented how this architecture is better for their users than either having senders recognize the audio or having the cloud server recognize it?

It would be great if the agreed changes and the concrete use cases for parts of the current design that the TAG (or others) were concerned, could be reflected in the explainer, asap.

Thanks.

Comment by @evanbliu Apr 28, 2025 (See Github)

Google Meet is one site that would only adopt this API if it provides a mechanism to guarantee that on-device speech recognition is used, though not for typical data residency reasons. Specifically for Google Meet, the performance of the latest on-device speech recognition language packs, in addition to biasing support, meets their stringent requirements, whereas the Open Speech API that powers the cloud implementation of the Web Speech API in Chrome does not. If on-device speech recognition is not available, Google Meet would continue to use its own cloud speech recognition implementation instead of the Web Speech API.

More commonly, developers have been requesting on-device speech recognition support simply because they do not want to send audio data to Google or other speech recognition services for various reasons. Here are some examples of these requests: https://webwewant.fyi/wants/55/ https://github.com/WebAudio/web-speech-api/issues/108 https://stackoverflow.com/questions/49473369/offline-speech-recognition-in-browser https://www.reddit.com/r/html5/comments/8jtv3u/offline_voice_recognition_without_the_webspeech/

padenot@mozilla.com may have other examples of EU partners with regulatory requirements regarding this.

As for receiver-side captioning in a WebRTC scenario, one primary benefit is that in a large meeting, if only one person requires captions, then only that person has to install and run on-device speech recognition. Otherwise, with sender-side captioning, every single person in the meeting would have to run on-device speech recognition. This document isn't viewable externally, but http://go/receiver-captions (Google only) has some trade-offs for receiver-side captioning specifically for Google Meet.

I'll work on updating the explainer in the next day or so with all of the changes and discussion in this thread!

Discussed May 5, 2025 (See Github)

Posted and closed.