design-reviews#591: Handwriting Recognition API

#591: Handwriting Recognition API

Opened Dec 17, 2020

HIQaH! QaH! TAG!

I'm requesting a TAG review of Handwriting Recognition API.

Handwriting is a widely used input method, one key usage is to recognize the texts when users are drawing. This feature already exists on many operating systems (e.g. handwriting input methods). However, the web platform as of today doesn't have this capability, the developers need to integrate with third-party libraries (or cloud services), or to develop native apps.

We want to add handwriting recognition capability to the web platform, so developers can use the existing handwriting recognition features available on the operating system.

Explainer: https://github.com/WICG/handwriting-recognition/blob/main/explainer.md
Security and Privacy self-review: https://github.com/WICG/handwriting-recognition/blob/main/security-privacy-self-review.md
GitHub repo: https://github.com/WICG/handwriting-recognition
Primary contacts (and their relationship to the specification):
- Jiewei Qian (@wacky6), Google, Editor
- Matt Giuca (@mgiuca), Google
Organization/project driving the design: Google
External status/issue trackers for this feature (publicly visible, e.g. Chrome Status):
- https://www.chromestatus.com/features/5263213807534080

Further details:

I have reviewed the TAG's API Design Principles
The group where the incubation/design work on this is being done (or is intended to be done in the future): WICG
The group where standardization of this work is intended to be done: unknown
Existing major pieces of multi-stakeholder review or discussion of this design: GitHub issues
Major unresolved issues with or opposition to this design:
- Complex script text segmentation: issue 6
This work is being funded by: Google

We'd prefer the TAG provide feedback as (please delete all but the desired option):

🐛 open issues in our GitHub repo for each point of feedback

Thanks.

Discussions

Discussed Jan 1, 2021 (See Github)

Privacy concerns? Supposedly doesn't collect new information about the user. Drawing data can already be collected with the canvas.

Sangwhan: why is it on the navigator?

Yves: about i18n - how do they recognise vertical vs horizontal, mixed ltr and rtl?

Sangwhan: guessing they delegate it off to a native library? It's unclear

Yves: they have the basic order of strokes, but wondering about the result of the API? Text, text with information about orientation?

Sangwhan: when you create a handwriting recognizer you pass the language you want to recognize

Yves: for japanese you can write it horizontally or vertically

Sangwhan: I don't think I've seen implementations supporting vertical on a digitizer, but theoretically people might

Yves: prediction result is a JS object containing the strings.. one string without i18n information is probably missing something

Sangwhan: how do you deal with codeswitching?

Yves: right, you have an EU language with Arabic mixed in. Multiple strings with different information, how do they handle that? We make them get a review from the i18n group?

Sangwhan: what is distance measured in? Pixels?

Hadley: would guess pixels but they should make that explicit

Yves: physical or logical pixels?

Sangwhan: and how does it tie in with the API? Will ask.

Sangwhan: you can collect more information than the canvas with a dedicated writing surface. People would consider this as a handwriting input method rather than a scribbleboard. Can probably be used to collect more information? not from the API level but from how users interact with it? Not convinced it operates on the same magnitude as canvas. I don't recall ever writing a full sentence on the canvas surface. More likely you would write a full sentence here.

Yves: tablets are the main use of handwriting recognition

Amy: could be purposed to collect signatures, even though they are not needed to be recognised?

Sangwhan: and why do they want this exposed to the web? should be an IME?

Comment by @cynthia Jan 26, 2021 (See Github)

We briefly looked at this during our F2F today, and had a couple early review questions:

How does this work when it comes to writing vertically or right-to-left?
Same question, but in the context of code-switching? (e.g. English word in between Arabic text?)
Does it make sense for this to be stuck directly on to navigator? Could you let us know why it is there?

@r12a do you have any input on this?

Comment by @cynthia Jan 26, 2021 (See Github)

What's the metric of the cartesians in the explainer? Are they physical pixels, logical pixels, or something else?

Comment by @r12a Jan 26, 2021 (See Github)

@cynthia https://github.com/WICG/handwriting-recognition/issues/4

I also made a bunch of other i18n-related comments (see the issue list).

Comment by @wacky6 Jan 28, 2021 (See Github)

Vertical writing.

Here I assume you mean a language that can be written both horizontally and vertically.

Google's recognizer generally returns characters in the order they were written (for the above type of languages). So it works in both writing directions (e.g. rtl, ltr, top-bottom). Our metric shows vertical written isn't commonly used by our users, so this feature hasn't got recent attentions.

We aren't sure how other recognizers work. Some may only work with one direction (and doesn't work at all for vertical writing). Some may ignore the character writing order.

WDYT to have a hint about writing direction? In case some recognizers need this information. Note, some recognizer may disregard this hint altogether.

RTL writing

For RTL languages, the recognizer already knows it should process text from right to left.

For LTR languages, but characters written from right to left (e.g. "hello" written in "olleh" order). It's a rare/uncommon scenario. I'm not sure what's the correct interpretation. The user perhaps want the text to be interpreted as "hello", but it's really up to the recognizer to decide what it will output. Either output can be considered valid IMO.

Mixed scripts.

The recognizer could determine the writing direction by looking at each character's written time and their spatial relations. Similarly for context switching.

For example,

Unidirectional text "ABC". The writing direction can be learned by looking at the order of each character (A->B->C or C->B->A).
Mixed: "AB cba CD" (upper-case / lower-case are two different scripts), "A->B->C->D->a->b", or, "A->B->C->D->b->a".

This being said, existing recognizers (those available on the market) don't support mixed scripts (e.g. english + arabic). They will recognize text as if the text is written in a single script (e.g. recognize arabic characters as english characters, and give less-ideal results).

I don't think we should try to solve the mixed script problem if the underlying implementations haven't solved it. Our solution may not work for them. Or, if the implementation is advanced, it doesn't care about whether we provide this information / hint).

Why navigator object

We choose navigator because it's preferred over alternatives (e.g. window, global constructor):

We expect handwriting recognizer to interact with platform-specific APIs, and support different features (on different platforms). Navigator seems natural based on this consideration of feature differences.

We don't have particular preferences on where the methods are. Are you suggesting we put the methods behind a attribute (e.g. navigator.handwritingService.doSomething())?

What's the metric of the cartesians in the explainer

The explainer examples use logical pixels.

The recognizer doesn't particularly care about the measurement unit, as long as all provided coordinates are measured in the same way (i.e. don't mix logical pixels and device pixels).

The recognizer implementation normalizes the coordinates, and perform recognition relatively (e.g. relative to the smallest character / block in the drawing).

Discussed Feb 15, 2021 (See Github)

Skipped.

Discussed May 1, 2021 (See Github)

Proposed comment:

During our May 2021 vf2f, @cynthia and myself did another pass at this review, thank you for all of the answers. Regarding adding a direction hint to the recognizer - we found that to be a useful futureproofing feature and recommend that you do. After going over the privacy & security questionnaire I am still not clear if the API exposes additional fingerprinting capabilities. With exposure of strokes, ordering of strokes and timing of strokes, I worry that models can be trained to easily recognize patterns for various disabilities. This will be a very unfortunate byproduct of this API. Is this something that you considered and could expand on?

Comment by @cynthia May 11, 2021 (See Github)

@wacky6, thank you for your patience! @atanassov and I looked at this during our F2F. Your response covers most of the questions we had - thanks a lot.

WDYT to have a hint about writing direction? In case some recognizers need this information. Note, some recognizer may disregard this hint altogether.

I think having that as an extension point would be useful - if there is some sort specific of post-processing that needs to be done based on this before it hits the recognizer, it feels like this information could be useful to expose.

For LTR languages, but characters written from right to left (e.g. "hello" written in "olleh" order). It's a rare/uncommon scenario.

We agree that this isn't an important scenario to handle. Our concerns on RTL was mostly about languages that are actually written left to right.

I don't think we should try to solve the mixed script problem if the underlying implementations haven't solved it.

If it's an unsolved problem, I think we don't need to delve too much into this.

We don't have particular preferences on where the methods are. Are you suggesting we put the methods behind a attribute (e.g. navigator.handwritingService.doSomething())?

Yes, this was one of the reasons we asked this question.

We were also a bit curious on three different tabs initiating multiple recognition contexts - is anything shared? (This question is based on the navigator layering)

(More comments based on the discussion with @atanassov to come in a bit.)

Comment by @atanassov May 12, 2021 (See Github)

During our May 2021 vf2f, @cynthia and myself did another pass at this review, thank you for all of the answers.

Regarding adding a direction hint to the recognizer - we found that to be a useful futureproofing feature and recommend that you do.

After going over the privacy & security questionnaire I am still not clear if the API exposes additional fingerprinting capabilities. With exposure of strokes, ordering of strokes and timing of strokes, I worry that models can be trained to easily recognize patterns for various disabilities. This will be a very unfortunate byproduct of this API. Is this something that you considered and could expand on?

Comment by @wacky6 May 13, 2021 (See Github)

Hi, @cynthia

Writing direction We'll add a direction hint to indicate the expected parsing / reading direction.

So that the recognizer for ("en" and "ar") can differentiate the following two outputs, for text "نشاط التدويل، W3C"

JS String: W3C, [Arabic characters]
JS String: [Arabic characters], W3C

Could you confirm this addresses your concern?

Navigator Recognition contexts are isolated for each recognizer. Different tabs don't share the recognition context. But they may use the same recognition service on the OS (e.g. process).

Is there any documents on navigator layering?

Hi, @atanassov

The handwriting process looks like this: User input --(1)--> Stroke Data --(2)--> Text

Websites can already collect handwriting and analyze them. All they need is some user input (step 1), and some analysis code. For example, ask user to draw on canvas, use PointerEvent to collect the drawing, then everything to a server for analysis (they don't have to use our API).

Our API is at step 2. It converts stroke data (represented with our proposed HandwritingStroke and HandwritingDrawing) to some text.

Websites can already analyze handwritings in JavaScript. Our API made this easier (call a method instead of supplying a bunch of JavaScript code) and more efficient (run in native code / accelerators). In short, our API isn't introducing new things that Web can't already do.

Comment by @cynthia Aug 31, 2021 (See Github)

Thank you for your feedback. We've discussed this in a breakout and concluded that this proposal is good to move forward - we'll discuss further in the plenary and close if everyone agrees. Thanks for bringing this to our attention.

As for the navigator layering, we don't have any formal recommendations - we'll discuss this in the plenary and provide feedback afterwards.

Comment by @r12a Aug 31, 2021 (See Github)

This being said, existing recognizers (those available on the market) don't support mixed scripts (e.g. english + arabic). They will recognize text as if the text is written in a single script (e.g. recognize arabic characters as english characters, and give less-ideal results).

I don't think we should try to solve the mixed script problem if the underlying implementations haven't solved it. Our solution may not work for them. Or, if the implementation is advanced, it doesn't care about whether we provide this information / hint).

Sure, text written in English rarely has Arabic text in it, but that's not true at all the other way around. Text written in Arabic and all the other languages that use RTL scripts will contain LTR Latin script text on a regular basis. Not only that, but they will also contain numbers, and those are written LTR within the RTL flow. Same goes for expressions, numeric ranges, etc. for some languages. For example, in Hebrew you'll write "Score: 82" as

Screenshot 2021-08-31 at 16 27 12

I don't think you'd want the text stored in memory to become "Score: 28".

Or how about: "No parking: 08:00 - 20:00". Will the text stored in memory indicate that you can't park during the day, or overnight – it depends on the direction in which the range is read, and that will depend on the rules of the language being used.

Note that https://github.com/WICG/handwriting-recognition/issues/4 already raises some of these issues, but as yet has no response.

Sorry, but I don't buy that you don't have to consider how this would work if implementations don't currently enable handwriting recognition properly for large percentages of the people on the planet. Our mission is to make the World Wide Web accessible worldwide. I think some thought has to be given to how to address the needs of the currently underserved millions of potential users.

Comment by @r12a Aug 31, 2021 (See Github)

Of course, if the recogniser recognises strokes and stores characters in the order they are written, then that may provide a solution, because someone writing "Score: 82" in Hebrew will write the 8 before the 2 (leaving a gap for it to fit). If the conversion of strokes to characters takes place after an input is completed, however, then mixed direction text will require parsing for direction changes.

Note, however, that in the former case, where strokes are converted on-the-fly, it's not straightforward either, since Arabic and Hebrew graphemes tend to be only half-written during the initial pass, and those graphemes are completed after the word is completed (eg. the top bar for scripts such as Devanagari).

Discussed Sep 1, 2021 (See Github)

Sangwhan: this was proposed closed a while ago and then Richard came in and wrote some stuff

Dan: it keeps getting updated which is good

Sangwhan: they can discuss with Richard on the issue but from our end it's okay.. problem is rtl and ltr switching.. eg. writing arabic and then numbers and back to arabic.. becomes icky with handwriting recognition. I don'tk now of an engine that does this well. The API and design doesn't factor this is in is based on the current state of affairs

Dan: I don't think we need to weigh in on that, one of the things that Rossen said in may was it's not clear about fingerprinting.. I would note that there's a very comprehensive privacy considerations section in the explainer which talks about the different fingerprinting risks associated with different types of recognisers which looks like they've spent some time thinking about this which is good to see. From that perspective they've taken some feedback from TAG into account.

Yves: the rtl ltr discussion started during review but it's still continuing, I don't think we should close we should wait for completion of the discussion on that aspect. And it's good that richard came up with very specific examples.

Amy: shouldn't that conversation be in i18n horizontal review ticket?

Sangwhan: that's issue 4 on their end with details and they're discussing there so that's fine. Can continue over there. From a design perspective I think we're okay.

Yves: as long as it's tracked somewhere

Sangwhan: it is tracked and linked to our issue

Dan: I think we should close it. [writes comment]. Multistakeholder?

Sangwhan: apple might implement it..

Dan: no signal no signal.. web developers like it

Sangwhan: a challenge - to be able to ship such a feature you need a handwriting recognition engine and not many companies have that. Mozilla doesn't have one. They'd have to duck tape some open source thing into the browser ot implement this. There is a bit of a only rich companies can implement this problem and I'm not sure if that should be factored in to a design review.

Thanks for the very comprehensive privacy & security section in the explainer. We're basically fine with the design. Since this relies on the presence of a handwriting recognizer software component that raises some concerns about implementability - especially across lower spec devices and in open source efforts.  There seems to be an issue regarding multi-stakeholder support as there's no documented support from other browser engines on Chrome Status - can you provide any feedback there?  What is the trajectory for this spec after **incubation** in WICG?  Where do you see this going?

Dan: comment left

Comment by @torgo Sep 16, 2021 (See Github)

Thanks for the very comprehensive privacy & security section in the explainer. We're basically fine with the design. Since this relies on the presence of a handwriting recognizer software component that raises some concerns about implementability - especially across lower spec devices and in open source efforts. There seems to be an issue regarding multi-stakeholder support as there's no documented support from other browser engines on Chrome Status - can you provide any feedback there? What is the trajectory for this spec after incubation in WICG? Where do you see this going?

Comment by @tomayac Sep 16, 2021 (See Github)

WebKit (https://lists.webkit.org/pipermail/webkit-dev/2021-March/031762.html) and Mozilla (https://github.com/mozilla/standards-positions/issues/507) have been asked for their opinions, but without a response so far.

Discussed Oct 25, 2021 (See Github)

Propose closing at plenary

Comment by @cynthia Dec 7, 2021 (See Github)

The feedback @r12a wrote above we think is important, but beyond the scope of this review and ideally should be discussed on the group's repository. As noted earlier, we are happy to see this move forward. Thank you for bringing this to our attention.

(And please ping other stakeholders again when you have time!)