design-reviews#948: Web Translation API

#948: Web Translation API

Opened Apr 24, 2024

こんにちは TAG-さん!

I'm requesting a TAG review of Web Translation API.

Browsers are increasingly offering language translation to their users. Such translation capabilities can also be useful to web developers. This is especially the case when browser's built-in translation abilities cannot help, such as:

translating user input or other interactive features;
pages with complicated DOMs which trip up browser translation;
providing in-page UI to start the translation; or
translating content that is not in the DOM, e.g. spoken content.

To perform translation in such cases, web sites currently have to either call out to cloud APIs, or bring their own translation models and run them using technologies like WebAssembly and WebGPU. This proposal introduces a new JavaScript API for exposing a browser's existing language translation abilities to web pages, so that if present, they can serve as a simpler and less resource-intensive alternative.

Explainer (minimally containing user needs and example code): https://github.com/WICG/translation-api
User research: none
Security and Privacy self-review: https://github.com/WICG/translation-api/blob/main/security-privacy-questionnaire.md
GitHub repo (if you prefer feedback filed there): https://github.com/WICG/translation-api
Primary contacts (and their relationship to the specification):
- Domenic Denicola (@domenic), Google, spec-writer
Organization/project driving the design: Google Chrome
External status/issue trackers for this feature (publicly visible, e.g. Chrome Status): https://chromestatus.com/feature/5172811302961152

Further details:

I have reviewed the TAG's Web Platform Design Principles
The group where the incubation/design work on this is being done (or is intended to be done in the future): WICG
The group where standardization of this work is intended to be done ("unknown" if not known): unknown. The Web Machine Learning W3C Working Group seems to explicitly rule out these sorts of high-level APIs in its charter. Maybe that can be amended though.
Existing major pieces of multi-stakeholder review or discussion of this design:
Major unresolved issues with or opposition to this design:
- It's unclear whether including translation from an unknown source language is worth including, or whether we should require web developers to do a two-step detect-language + translate language process. Explainer discussion, https://github.com/WICG/translation-api/issues/1
- The exact way in which language tags should be handled, especially to give good interoperability, is still unclear. Explainer discussion, https://github.com/WICG/translation-api/issues/2
- Naive implementations that give the site direct access to info about which language packs are downloaded can provide information that fingerprints the user. Various mitigations are possible but it's unclear which strike the best balance or are most feasible. Explainer discussion, https://github.com/WICG/translation-api/issues/3
- We might want to expose some information about whether the translation is on-device or not, and about the model version. But this could also be harmful for interop. See explainer discussion at the bottom of the "Goals" section.
This work is being funded by: Google

Discussions

Comment by @shivaylamba Apr 24, 2024 (See Github)

I like this idea. And especially giving the capability to the user to choose between either using on-device or cloud based models. Given how performant and efficient on-device AI models have become using WASM/WebGPU, we can definitely vouch for the translations to take place while also minimizing risk of any data being sent to cloud.

Comment by @jasonmayes Apr 26, 2024 (See Github)

I think having an auto detect option could be really cool. Scenario: Imagine you are in a Google Meet video call and everyone is from a different country. This way, you could make a Chrome Extension that auto detects the language and is able to translate for each person speaking even if language changes between detection sessions. That would be super useful.

It would be good if you are thinking of hyrbid approach for the programmer to explicitly require offline on device model inference here in case there are privacy aspects they need to adhere to for their application. I think 3 options are useful:

client side (forces on device only - if cant run for some reason just fails with error to capture)
hybrid (browser chooses based on network conditions / device capabilities) and uses the right one in the moment
server side (forces server side only)

Comment by @shivaylamba Apr 26, 2024 (See Github)

I think having an auto detect option could be really cool. Scenario: Imagine you are in a Google Meet video call and everyone is from a different country. This way, you could make a Chrome Extension that auto detects the language and is able to translate for each person speaking even if language changes between detection sessions. That would be super useful.

It would be good if you are thinking of hyrbid approach for the programmer to explicitly require offline on device model inference here in case there are privacy aspects they need to adhere to for their application. I think 3 options are useful:

client side (forces on device only - if cant run for some reason just fails with error to capture)

hybrid (browser chooses based on network conditions / device capabilities) and uses the right one in the moment

server side (forces server side only)

I think we can also recommend the user based on their hardware, the type of model inference to be used (cloud vs on device). For instance, if the user's hardware doesn't have a GPU and has limited RAM, it might be better suited for cloud inference. Let me know what do you think @jasonmayes

Discussed May 13, 2024 (See Github)

we didn't get to this one

Discussed May 13, 2024 (See Github)

This will come up in Breakout C later.

Matthew: concern about how users will understand the use of automated translation - will it be obvious that it's automated to users? So they know there are risks. Also don't want to encourage the reapeated use of ML instead of professionally done translations, but understand that is not always possible. (Similar example: captions on videos for accessibility - helpful, but the ideal is always going to be to prodcue captions, rather than have ML work on them, for accuracy, inclusion, and sustainability.)

Max: If the device doesn't have the requisite hardware (e.g. GPU) the UX could be downgraded - important to consider this.

Discussed Jul 29, 2024 (See Github)

Matthew: we need Lea's input on API shape... however I have several thoughts...

First one is they did talk about a potential extension - running on device only or going to the cloud. That could be important. I think that's important. They haven't done research with developers - would be good for them to see what developers think. It strikes me in some use cases that would need to be controlled.
Don't know if they've talked with I18N WG - specifically about their issue 11 on specificity - BCP codes, etc... I18N would have thoughts on that. Also streaming input support.
They're namespacing all the functions under ai
I have a general concern around efficiency and accuracy. Concerned that developers would rely on this rather than doing translation for static assets as part of their development process. Some non-normative encouragement to not do that maybe?
Donwload progress reporting ... downloading the language support .. they think the web app should display the progress for that. Shouldn't that be part of the UA's UI?
Also the API used signals. Using signals to see if download has been aborted, etc...

Peter: I think the signal - abort signal - is ok.

Lea: yes.

Dan: cloud vs on device...

Matthew: at the moment it's how the browser wants to implement it. so it could be cloud based or on-device.

Lea: the author should be able to say "if you can translate this without going out to the network, fine"

Tess: it's about fetching language models... The translation still might be fast or slow...

Matthew: i think in some cases they are talking about it going out to the cloud to do the translation... I think we need to check up on that. So we could have: fast-local; slow-local; or remote.

Tess: the slow-local case could also be a privacy issue...

Lea: what's to prevent a web page from iterating over all the languages?

Tess: throttling in the implementaiton...

Lea: also to be clear: the need is there and I'm excited to see this. would be nice if there were an actual list of use cases...

Dan: UI elements or content or both?

Lea: both...

Why a whole different object to construct instances? A static method on ATTranslator would also have the advantage of making it clear what objects are constructed. Same for capabilities()
I wonder if it would make sense to have a single object, rather than a language detector and a translator, as the functionality seems quite related and often language detection is the first step.

We agree with and support the user need. Here are our thoughts...

It would be good to have a list of use cases. We could think of some from our own experience, but they may be different than the ones you had in mind. Having an explicit list of use cases ensures that everyone is on the same page.
Please continue chatting with the I18N WG folks about issue 11, and streaming input support.
We're concerned about the use of the network. Specifically, use of the network to download a model, or use of the network to actually perform the translation, could introduce both delay and privacy issues. Is it possible for the developer to specify: "only do this if network access is not required"? We feel that differentiation between fast-local, slow-local (i.e. with downlaod), and remote/cloud-based cases is important for MVP.
We loved the approach you propose to partitioning, and using a fake download, to mitigate fingerprinting!
We recommend a translation-specific namespace instead of ai.
Why is a separate namespace needed at all? We understand these objects are not constructible due to the asynchronicity, but since they are creating instances of the same class, making this obvious by adding the factory as a static method of this class seems more consistent with precedent. Same for the capabilities() method, we don't understand why this needs to live in a different namespace, and we think that the more objects this API is spread across, the harder it will be for authors to understand how the different parts fit together.
We think there should be a prominent note encouraging developers to make use of professional translation of pre-existing content rather than doing automatic translation wherever they can - for both accuracy and efficiency reasons.
It seems to make more sense, and help simplify the API and alleviate some privacy concerns if the UA renders the download progress bar.
We did wonder if it would make sense to have a single object for the detection and translation, since they are so related (and often detection is the first step to translation). Was this direction explored?

</blockquote>

Comment by @matatk Jul 29, 2024 (See Github)

We agree with and support the user need. Here are our thoughts...

It would be good to have a list of use cases. We could think of some from our own experience, but they may be different than the ones you had in mind. Having an explicit list of use cases ensures that everyone is on the same page.
Please continue chatting with the I18N WG folks about issue 11, and streaming input support.
We're concerned about the use of the network. Specifically, use of the network to download a model, or use of the network to actually perform the translation, could introduce both delay and privacy issues. Is it possible for the developer to specify: "only do this if network access is not required"? We feel that differentiation between fast-local, slow-local (i.e. with downlaod), and remote/cloud-based cases is important for MVP.
We loved the approach you propose to partitioning, and using a fake download, to mitigate fingerprinting!
We recommend a translation-specific namespace instead of ai.
Why is a separate namespace needed at all? We understand these objects are not constructible due to the asynchronicity, but since they are creating instances of the same class, making this obvious by adding the factory as a static method of this class seems more consistent with precedent. Same for the capabilities() method, we don't understand why this needs to live in a different namespace, and we think that the more objects this API is spread across, the harder it will be for authors to understand how the different parts fit together.
We think there should be a prominent note encouraging developers to make use of professional translation of pre-existing content rather than doing automatic translation wherever they can - for both accuracy and efficiency reasons.
It seems to make more sense, and help simplify the API and alleviate some privacy concerns if the UA renders the download progress bar.
We did wonder if it would make sense to have a single object for the detection and translation, since they are so related (and often detection is the first step to translation). Was this direction explored?

Comment by @domenic Jul 30, 2024 (See Github)

Thanks for the review!

It would be good to have a list of use cases. We could think of some from our own experience, but they may be different than the ones you had in mind. Having an explicit list of use cases ensures that everyone is on the same page.

I believe these are listed in the first paragraph of the explainer. https://github.com/WICG/translation-api/blob/main/README.md#explainer-for-the-web-translation-and-language-detection-apis

3. We're concerned about the use of the network. Specifically, use of the network to download a model, or use of the network to actually perform the translation, could introduce both delay and privacy issues. Is it possible for the developer to specify: "only do this if network access is not required"? We feel that differentiation between fast-local, slow-local (i.e. with downlaod), and remote/cloud-based cases is important for MVP.

It is possible for the developer to avoid downloading the model, if the browser intends to support on-device translation, by checking if capabilities.available is "readily" (as opposed to "after-download").

We haven't yet exposed whether the translation is done entirely on-device or through cloud services, because doing so could possibly cause developers to write code that excludes certain browsers. But, we understand this could be worthwhile. This is mentioned in https://github.com/WICG/translation-api/blob/main/README.md#goals . We'll closely monitor this space, to find out if there are developers who need this ability, and/or whether any browsers actually plan to implement using cloud services.

4. We loved the approach you propose to partitioning, and using a fake download, to mitigate fingerprinting!

Thanks for the kind words, although at least the fake downloads idea isn't looking too promising at the moment. https://github.com/WICG/translation-api/issues/10

5. We recommend a translation-specific namespace instead of ai.

This is related to some ongoing work on other AI model-based APIs which are not yet at the stage of being ready for TAG review. We want them all to share a namespace and a set of common API patterns (e.g. sibling create() and capabilities() methods; "no"/"readily"/"after-download" availability signals; destroy() methods; specific AbortSignal integration patterns; etc.)

I understand it can be hard to judge this in the absence of other reviewable explainers, so we can revisit this later when we make more progress on those. Stay tuned!!

6. Why is a separate namespace needed at all? We understand these objects are not constructible due to the asynchronicity, but since they are creating instances of the same class, making this obvious by adding the factory as a static method of this class seems more consistent with precedent. Same for the capabilities() method, we don't understand why this needs to live in a different namespace, and we think that the more objects this API is spread across, the harder it will be for authors to understand how the different parts fit together.

I thought about this avenue as well.

First, to clarify, we do need a separate capabilities() method so that web developers can determine model capabilities without initiating a create operation. (Which can be expensive, both in bandwidth and in GPU memory.) So we cannot merge that into the translator object. And, this method needs to be asynchronous, as the source of truth for the capabilities information will not generally be in the same process as the WindowOrWorkerGlobalScope. (We could proactively load the capabilities information into every WindowOrWorkerGlobalScope, but that would cause all sites, including those not using these APIs, to pay the cost. Which is undesirable.)

So I think what you're suggesting ends up converting the API from something like

const capabilities = await ai.translator.capabilities();
const translator = await ai.translator.create();

to something like

const capabilities = await AITranslatorCapabilities.get();
const translator = await AITranslator.create();

I think this is a viable direction. A bit uglier in my opinion, but if the goal is to minimize the number of namespaces, then it does work. We can keep it as a possibility, and see which web developers prefer, or if other arguments appear on either side.

8. It seems to make more sense, and help simplify the API and alleviate some privacy concerns if the UA renders the download progress bar.

The exact UI signals for when these APIs are in use is definitely worth exploring. Browser UI teams are not always excited about adding "noise" to what the user sees, but if we end up needing a permission prompt or similar anyway for privacy reasons, maybe we could convince them to add in some progress measures.

9. We did wonder if it would make sense to have a single object for the detection and translation, since they are so related (and often detection is the first step to translation). Was this direction explored?

To some extent yes. Before https://github.com/WICG/translation-api/commit/2cb6637e6584c9b1f43d49309a8a395bd9b927e7 the APIs were more tighly coupled, both existing on a self.translation API. We still had separate detector and translator objects, though. This seems necessary, because a translator has a specific source/target language pair associated with it, and a detector does not.

We separated the APIs even more once we looked into the possible implementation strategies. It turns out that language detector models and translation models are generally quite different. And we wanted to allow browsers to take advantage of these differences, instead of forcing them to unify to a lowest-common-denominator, or expose strange inconsistencies to web developers.

For example, you can find small off-the-shelf language detector models supporting over 80 languages. (If I am reading this MDN page correctly correctly, both Firefox and Chrome use such a model for the Web Extensions i18n.detectLanguage() API.) But, for example, Firefox's language translation models support 10 languages. In our previous design, we had a single supportedLanguages() method, which doesn't make sense given such a setup.

Comment by @jasonmayes Jul 30, 2024 (See Github)

Just sharing an example I was tagged in recently showing how JS users are using Web AI already to do real-time translation in browser entirely client side - may be useful to see use cases by real existing users that could help shape discussion here: https://twitter.com/thorwebdev/status/1816745831225290909

Comment by @plinss Jul 30, 2024 (See Github)

So I think what you're suggesting ends up converting the API from something like

const capabilities = await ai.translator.capabilities(); const translator = await ai.translator.create();

to something like

const capabilities = await AITranslatorCapabilities.get(); const translator = await AITranslator.create();

Capabilities could simply be a static method on the translator, no?

const capabilities = await Translator.capabilities();

Comment by @domenic Jul 30, 2024 (See Github)

Capabilities could simply be a static method on the translator, no?

Sure, either way, although that's less symmetric than having each class vend its own instances, and takes us back toward kinda using classes as namespaces (just this time with static methods).

Comment by @martinthomson Jul 30, 2024 (See Github)

We recommend a translation-specific namespace instead of ai.

This is related to some ongoing work on other AI model-based APIs which are not yet at the stage of being ready for TAG review. We want them all to share a namespace and a set of common API patterns (e.g. sibling create() and capabilities() methods; "no"/"readily"/"after-download" availability signals; destroy() methods; specific AbortSignal integration patterns; etc.)

One wonders if, in the future, this will be as meaningless as an API called computer or software. Please reconsider.

Comment by @LeaVerou Jul 31, 2024 (See Github)

Capabilities could simply be a static method on the translator, no?

Sure, either way, although that's less symmetric than having each class vend its own instances, and takes us back toward kinda using classes as namespaces (just this time with static methods).

IMO It’s about what makes more sense in terms of entity-relationships. The human mental model is that we’re querying the capabilities of the translator; creating a TranslatorCapabilities object ¹ is not a natural language construct.

Which actually makes me wonder if we need this object at all. Why not simply an async getter and an async function on Translator?

as @martinthomson said, please reconsider using "AI" as part of the name for any of these. ↩

Comment by @plinss Jul 31, 2024 (See Github)

IMO It’s about what makes more sense in terms of entity-relationships. The human mental model is that we’re querying the capabilities of the translator; creating a TranslatorCapabilities object is not a natural language construct.

Agreed, if authors care about getting the capabilities of the translator, querying the translator directly makes more sense. Having each class vend its own instances is an example of putting theoretical purity of an arbitrary design pattern over user needs

Comment by @domenic Jul 31, 2024 (See Github)

Which actually makes me wonder if we need this object at all. Why not simply an async getter and an async function on Translator?

What type would that async getter/function return?

Discussed Aug 5, 2024 (See Github)

Hi Domenic - Our feedback that we sent above stands, particularly regarding the name space of this API. We don't think it belongs in a .ai namespace. We think that the name spaces should be funciton-led rather than technology-used-to-deliver-the-function-led. Your proposal under point 6 (AITranslatorCapabilities) is a step forward but we would prefer Translator.capabilities(). We also think your response to point 3 of our feedback is fine. We also would strongly encourage that the spec include a non-normative note encouraging the use of professional translation services where possible (point 7 of our original feedback). Please consider this. We're going to close as satisfied with concerns due to these concerns. Given the current state of the spec, and lack of venue, we understand this to be an early review. We expect that addressing the points above will be important for further standardization. ✨

</blockquote>

Issue closed with sparkles.

Discussed Aug 5, 2024 (See Github)

Matthew: some additional feedback I've been catching up with... especially about the AI namespace... But there is other stuff they want to put in this AI namespace... the concern is : if they're going to put something in a namespace shouldn't it be problem oriented rather than technology used to solve the problem...

Dan: I'm not anti-LLMs in the browser. Say a browser used a different technology besides AI based to do translation? Then it wouldn't make sense anymore?

Peter: if they want to expose an LLM at a low level then call it an LLM...

Peter: if they are making a new common API pattern then let's make sure it makes sense for everyone... and put it in our Design Principles document...

Dan: propose we say something like

</blockquote>

Matthew: a little more discussion to go... I'm really concerned about point 7 from our original feedback. But I'd like to see if we have consensus about this. It's related to what I've seen in Accessibility. The one thing we can do is set the tone for the discussion here... W3C is also working on this AI and the web document...

Dan: I agree.

Dan: should we bring up the AI and web document in this dicussion?

Matthew: they have been requesting feedback since the AC meeting...

Matthew: we should keep an eye on it and see if Lea can respond...

Comment by @torgo Aug 7, 2024 (See Github)

#948: Web Translation API

Discussions

Footnotes