design-reviews#568: WebXR Hand Input API Specification

#568: WebXR Hand Input API Specification

Visit on Github.

Opened Nov 12, 2020

HIQaH! QaH! TAG!

I'm requesting a TAG review of WebXR Hand Input.

The WebXR Hand Input module expands the WebXR Device API with the functionality to track articulated hand poses.

Explainer¹ (minimally containing user needs and example code): https://github.com/immersive-web/webxr-hand-input/blob/master/explainer.md
Specification URL: (currently incomplete) https://www.w3.org/TR/webxr-hand-input-1/
Tests: https://github.com/web-platform-tests/wpt/tree/master/webxr/hand-input
Security and Privacy self-review²: https://immersive-web.github.io/webxr-hand-input/#privacy-security
GitHub repo (if you prefer feedback filed there): https://github.com/immersive-web/webxr-hand-input
Primary contacts (and their relationship to the specification):
- Lachlan Ford (@fordacious), Microsoft
- Manish Goregaokar, (@Manishearth), Invited Expert and original point of contact
Organization(s)/project(s) driving the specification: Immersive Web Working Group https://www.w3.org/immersive-web/
Key pieces of existing multi-stakeholder review or discussion of this specification:
External status/issue trackers for this specification (publicly visible, e.g. Chrome Status): https://github.com/immersive-web/webxr-hand-input/issues

Further details:

I have reviewed the TAG's API Design Principles
Relevant time constraints or deadlines:
The group where the work on this specification is currently being done: Immersive Web Working Group
The group where standardization of this work is intended to be done (if current group is a community group or other incubation venue): Immersive Web Working Group
Major unresolved issues with or opposition to this specification:
This work is being funded by:

We'd prefer the TAG provide feedback as (please delete all but the desired option):

☂️ open a single issue in our GitHub repo for the entire review

Discussions

Discussed Nov 23, 2020 (See Github)

[some discussion on this and especially questions around what happens when the user's hands differ - e.g. if they are missing a part of their finger...]

Alice: what are the extra joints for... how does it work for missing joings, missing fingers?

Yves: each hand is one imput source...

Alice: medical names is kind of a no-no from our naming principles....

Comment by @alice Nov 24, 2020 (See Github)

One quick question: can you explain why the API is named around medical terms for bones, rather than something more straightforward?

Comment by @asankah Nov 24, 2020 (See Github)

The security/privacy self review states:

Data returned from this API, MUST NOT be so specific that one can detect individual users. If the underlying hardware returns data that is too precise, the User Agent MUST anonymize this data (ie by adding noise or rounding) before revealing it through the WebXR Hand Input API.

Could you elaborate a bit more on how an implementation should evaluate a noising or rounding strategy? I.e. how should an implementation evaluate anonymity?

Would there be recommendations around minimum fidelity for sensor readings?

Comment by @Manishearth Nov 24, 2020 (See Github)

@alice

One quick question: can you explain why the API is named around medical terms for bones, rather than something more straightforward?

This is what every other XR platform does for hand input and we wanted to be consistent with expectations. There aren't any good names to use otherwise, you could number things at best; but it still gets confusing: there is a joint before each knuckle that needs to be included (the thumb one is the most important one), and intuitively the finger ends at the knuckle.

Comment by @Manishearth Nov 24, 2020 (See Github)

@asankah

Could you elaborate a bit more on how an implementation should evaluate a noising or rounding strategy? I.e. how should an implementation evaluate anonymity?

At the moment, we don't have a clear idea of this: @fordacious / @thetuvix / @cabanier might though. This is one of the bits of privacy work I'd like to see as we move forward (since I consider the API surface mostly "done").

It also might be worth downgrading this to a SHOULD, since a valid choice for an implementation to make is to expose precise data but be clear about fingerprinting risks in the initial permissions prompt.

Comment by @cabanier Nov 24, 2020 (See Github)

The Oculus browser implementation exposes a hand model that is the same for everyone. The underlying implementation has additional information to make the model better match the user's hands but we decided not to apply that to preserve privacy.

Comment by @alice Dec 1, 2020 (See Github)

Regarding naming: I see that the Unity hand tracking API, for example, doesn't use the medical names for bones. They use a number to indicate the joint number.

The TAG design principles doc notes:

API naming must be done in easily readable US English. Keep in mind that most web developers aren’t native English speakers. Whenever possible, names should be chosen that use common vocabulary a majority of English speakers are likely to understand when first encountering the name.

and also:

You will probably not be able to directly translate an API available to native applications to be a web API.

Instead, consider the functionality available from the native API, and the user needs it addresses, and design an API which meets those user needs, even if the implementation depends on the existing native API.

I don't think using the bone name reduces the ambiguity either, since you're referring to a joint rather than the bone in any case.

Each of your examples sets up a structure like:

[ [XRHand.INDEX_PHALANX_TIP, XRHand.INDEX_METACARPAL],
  [XRHand.MIDDLE_PHALANX_TIP, XRHand.MIDDLE_METACARPAL],
  [XRHand.RING_PHALANX_TIP, XRHand.RING_METACARPAL],
  [XRHand.LITTLE_PHALANX_TIP, XRHand.LITTLE_METACARPAL] ]

... would it work to have the API expose the hand data via a richer data structure?

Regarding privacy: the strategy @cabanier mentions seems like a good start.

I'm also trying to understand the relationship between hand as a member of XRInputSource, and the primary action concept. Does hand input provide a way of generating a primary action?

Also, can you give some background on how hand tracking works for people who are missing or unable to use one or more fingers on the hand(s) being used for hand tracking - how does this affect the data which is provided to the application?

Finally, maybe a silly question - if an application wanted to track both hands, would that be two separate InputSources?

Comment by @domenic Dec 1, 2020 (See Github)

An issue that may be of interest to the TAG, as it concerns design guidelines for modern APIs: https://github.com/immersive-web/webxr-hand-input/issues/70

Comment by @cabanier Dec 1, 2020 (See Github)

Regarding naming: I see that the Unity hand tracking API, for example, doesn't use the medical names for bones. They use a number to indicate the joint number.

I agree that the current names are not easily understood, even by native English speakers. Should we rename the joints with simpler names, much like the unity example you listed?

Each of your examples sets up a structure like: ... ... would it work to have the API expose the hand data via a richer data structure?

Actual code wouldn't use those structures. I think @Manishearth provided that to clarify how the mapping is done.

I'm also trying to understand the relationship between hand as a member of XRInputSource, and the primary action concept. Does hand input provide a way of generating a primary action?

The hands API is not involved in this. Each hand will also be a "controller" which has actions associated with it.

Also, can you give some background on how hand tracking works for people who are missing or unable to use one or more fingers on the hand(s) being used for hand tracking - how does this affect the data which is provided to the application?

There is an issue on this. The spec currently defines that the hand will always returns all the joints.

Finally, maybe a silly question - if an application wanted to track both hands, would that be two separate InputSources?

Yes :-)

Comment by @alice Dec 1, 2020 (See Github)

Should we rename the joints with simpler names, much like the unity example you listed?

I think that would be helpful to at least consider - naming the joints by number also makes it easier to understand the ordering without memorising or looking up the names of the bones each time (and also remembering that the joint comes before the named bone).

Actual code wouldn't use those structures. I think @Manishearth provided that to clarify how the mapping is done.

I see. Could we see some more realistic code examples somewhere?

The hands API is not involved in this. Each hand will also be a "controller" which has actions associated with it.

Can you expand on this? How would someone using hand input access the default action?

There is an issue on this. The spec currently defines that the hand will always returns all the joints.

Great issue, thank you!

Comment by @Manishearth Dec 2, 2020 (See Github)

Regarding naming: I see that the Unity hand tracking API, for example, doesn't use the medical names for bones. They use a number to indicate the joint number.

A problem with this is that it's not extensible, we're not exposing all of the hand joints that exist, we're exposing all of the hand joints that are typically used in VR hand tracking.

I find numbering to be more confusing because different platforms may choose to index differently: e.g. the indexing changes based on which carpals and metacarpals you include. For example, on Oculus/Unity only the thumb and pinky fingers have metacarpals, and the thumb also has a trapezium carpal bone. On the other hand (hah), OpenXR provides a metacarpal bone for all fingers, but no trapezium bone. So numbers don't really carry a cross-platform useful meaning.

If you just want to iterate over all of the joints, you can do that without knowing the names, but if you're going to be thinking about detecting gestures, I find names and a diagram to be far easier than plain numbers. Most humans have more than these 25 bones (+ tip "bones") in each hand, "index joint 0" doesn't tell me anything unless you show me a diagram.

I don't think using the bone name reduces the ambiguity either, since you're referring to a joint rather than the bone in any case.

it's both, really, since the orientation of that space is aligned with the named bone.

I'm also trying to understand the relationship between hand as a member of XRInputSource, and the primary action concept. Does hand input provide a way of generating a primary action?

Yes, but that's not under the purview of this spec at all. Oculus Browser and Hololens use pinch/grab actions for the primary action. The precise selection for the primary action gesture is up to the platform defaults.

Actual code wouldn't use those structures. I think @Manishearth provided that to clarify how the mapping is done.

The first example is doing this because it is outdated: it is iterable now so you don't need that array. The second example does need this.

I considered a structured approach in the past but there are basically many different ways to slice this data based on the gesture you need, so it made more sense to surface it as an indexable iterator and let people slice it themselves. Also, starting with a structured approach now may lock us out of being able to handle hands with more or less than five fingers in the future.

I can update the explainer to use the iterator where possible!

Also, can you give some background on how hand tracking works for people who are missing or unable to use one or more fingers on the hand(s) being used for hand tracking - how does this affect the data which is provided to the application?

As Rik said, https://github.com/immersive-web/webxr-hand-input/issues/11 covers this. At the moment this is entirely based on platform defaults: some platforms may emulate a finger, others may not detect it as a hand (unfortunate, but not something we can control here).

Currently all of the hand tracking platforms out there are all-or-nothing, AIUI, which means that they will always report all joints, and if some joints don't exist they'll either emulate them or refuse to surface a hand.

I want to make progress here, but I fear that doing so without having platforms that support it is putting the cart before the horse. A likely solution would be where you can use an XR feature descriptor to opt in to joints being missing as an indicator of "I can handle whatever configuration you throw at me". Polydactyl hands will also need a similar approach.

Comment by @Manishearth Dec 2, 2020 (See Github)

Can you expand on this? How would someone using hand input access the default action?

It depends on the platform, it's typically some kind of pinching gesture. It's whatever people use for "select" when using hands on the rest of the platform, outside of the web. This is how the WebXR API treats the primary action for physical controllers as well: it's whatever button people will be using on the rest of the platform (usually a trigger).

Each XRHand is owned by an XRInputSource, which represents an input source, and the actions are tied to that, as defined in the core spec. The XRHand surfaces additional articulated joint information about physical hand input sources, but it was already spec-compliant for an XR device to use hands as input without needing to opt in to the XR Hand Input specification.

Comment by @Manishearth Dec 2, 2020 (See Github)

I can update the explainer to use the iterator where possible!

Oh, actually, both examples need it to be explicit. A structured API around this might be useful, and I'm open to adding one, but I'm wary of locking out accessibility in the future as platforms start exposing more info about hands with more or less than 5 fingers.

Comment by @Manishearth Dec 2, 2020 (See Github)

Note: We're probably changing the constants to enums.

Discussed Dec 7, 2020 (See Github)

Sangwhan: it's very medical.

Alice: that is kind of the main point of feedback - asking people to memorize the bone names...

Hadley: but if we need a vocab, that's a good one to use...

Sangwhan: could use index numbers...

Alice: i suggested that... referencing the unity API... easier to understand... The response made some sense. Different platforms expose different amounts of information - if you're using a numbered system you still want to be able to see which parts of the hand are included in that number system... It's a subset of possible bones.

Sangwhan: if you extend it - you have this weird situation where you have 1,2,3,4,.... when you add something in it could become 25.

Alice: the last commetn said it's changing the constants to enums... which avoids this problem.

Alice: nobody has come up with a plain english set of names for every bone in the hand...

Hadley: there are lots of resources out thee [to help developers know these names]

Hadley: I can imagine a future need to name other bones in the body and we've already used this special thing for hands...

Alice: it's about detecting a gesture... based on image recognition... If there were some other vocab to use we should use that but there isn't.

... I gave some feedback to ask if they could consider a richer data structure...

Dan: performance issue... something something uncanny valley

Sangwhan: isn't this API just exposing a network of vertices in a semantic OO form...

Alice: I would like to see the updated examples... They talked about it being iteratable.. Don't understand how that helps...

Comment by @alice Dec 8, 2020 (See Github)

Thanks for making the change to enums!

Thanks also for the more in-depth explanation of why the anatomical terms make the most sense - the extensibility argument in particular is very reasonable.

Regarding a structured API - could you expand on the implication for accessibility?

Comment by @cabanier Dec 8, 2020 (See Github)

Regarding a structured API - could you expand on the implication for accessibility?

Can you elaborate what you mean by this?

Comment by @Manishearth Dec 8, 2020 (See Github)

Regarding a structured API - could you expand on the implication for accessibility?

As mentioned earlier I'm wary of designing anything that can handle users with uncommon hand configurations (e.g. polydactyl users) until we have accessible device APIs that this can be built and experimented on. It's reasonably easy to design things without closing the door to future improvements for the unstructured API, but the more structure we introduce, the more assumptions about the hand we introduce. Ideally, such a structured API would handle changes in hand structure. I would rather not close these doors, which is why I'd like to start with the unstructured API.

I'm not fully against adding a structured API -- I think it would be pretty nice to have -- but I'm mostly comfortable letting frameworks handle this right now.

Comment by @alice Dec 9, 2020 (See Github)

I guess I'm still not quite getting how a set of enums is more flexible than a fully structured API, since the naming of the enums already implies a certain hand structure.

Comment by @Manishearth Dec 9, 2020 (See Github)

I guess I'm still not quite getting how a set of enums is more flexible than a fully structured API, since the naming of the enums already implies a certain hand structure.

I think it's more that the enums are not necessarily super flexible, but they're also not the right approach for uncommon hand structures, which will likely need a level 2 and a structured API, but I don't want to design the structured API until we better understand how uncommon hand structures will work at the device level. The alternative is designing a structured API now, but having to design a second one when we get more devices that can handle uncommon hand structures and having a better understanding of how this API should work.

Comment by @alice Dec 9, 2020 (See Github)

Thanks for the explanation.

This does raise some questions about where the responsibility lies for designing a more inclusive API - if manufacturers are not being inclusive, do we just wait for them to get around to it? Do we spend some effort imagining what a more inclusive system might look like in the meantime?

I don't have answers for these questions, personally, but I think they're worth thinking about (obviously they don't just apply to this API, but it's a good example to consider).

Comment by @Manishearth Dec 9, 2020 (See Github)

Do we spend some effort imagining what a more inclusive system might look like in the meantime?

I have been spending some effort on this, and I plan to do more of this as well! I have ideas on how this could work well. I'm just wary of including this in the spec given that it actually working well requires a decent amount of buy in from device manufacturers, and I don't perceive the existence of the API to be sufficient pressure to do this.

I'm hoping to spend some of the WGs time on this issue (after all, many device manufacturers are part of the WG!) after having more conversations with potentially affected users, but I don't have the time to start that just yet.

Comment by @fordacious Jan 4, 2021 (See Github)

When is a TAG review officially completed? What is the next steps?

Comment by @cabanier Jan 5, 2021 (See Github)

When is a TAG review officially completed? What is the next steps?

@domenic the tag filed 2 issues. Was that the extent of the review or do we have to wait for an official blessing?

Discussed Jan 11, 2021 (See Github)

Alice: I think we are actually done with this one. I should respond to the last comment that says I have been spending effort imaginging a more inclusive system. it's very hardcoded that everyone hasa got two hands and five fingers.. well any number of hands. Any given hand has 5 fingers, any given finger has the expected joints. Which obviously doesn't bear up to reality. We talked about this is the baseline API until someone ships an API that takes that into account, but they've got to do the classic thing, it's hardware, so the superset of what all the underlying things do. I was posing the question of whose responsibility is it to come up with the more inclusive version, so the answers are quite good, they have ideas, have been working with some of the WGs including device manufacturers, but they don't hae time yet. That seems pretty good. The one outstanding issue that another commenter raised right up in the second comment about privacy, the fact that it returns real metrics about peoples finger lengths etc and this is the one where the explainer starts talking about hand phrenology. It's really wild. It might be worth reiterating that point. They ask for elaboration on anonymity - we don't have a clear idea of this.

Dan: it says implementations are required to employ strategies to mitigate, reduce precision of sampling data, adding noise

Alice: I guess that's fine. I suggested could it be a richer data structure and a bunch of enums and they said no not right now. They did change it to enums instead of constants but other than that. It looks pretty good. I'm happy to comment and propose closing.

Dan: I think we should do that, we've gone around and around and they've been responsive to our feedback. I support that.

Comment by @alice Jan 12, 2021 (See Github)

We just discussed this in our breakout meeting.

Thank you so much for your patience and responsiveness through this process! We're happy with how this is progressing, so I'm proposing to close this (and it will likely be closed at our plenary tomorrow).