design-reviews#954: Element Capture

#954: Element Capture

Opened May 10, 2024

Hej TAG!

I'm requesting a TAG review of Element Capture.

A combination of pre-existing mechanisms (getDisplayMedia, Region Capture) already allows Web applications to capture a portion of the current tab as video MediaStreamTrack, robustly cropping away irrelevant pixels. Such videos can than be transmitted remotely; removing pixels not intended for sharing helps the sharing user's privacy, and prevents distraction by the receiving users. It also helps conserve compute and network resources.

Our new API, Element Capture, takes this a step further, allowing Web applications to remove unwanted occlusions. For example, if a private message notification appears over the shared region, it is possible to avoid capturing that message, which also avoids transmitting it remotely, and therefore helps uphold privacy guarantees implicitly made to the user, who had only intended to share the target-region, and not whatever happened to be drawn over it.

Explainer: https://github.com/screen-share/element-capture/blob/main/README.md
User research: N/A
Security and Privacy self-review: https://github.com/screen-share/element-capture/blob/main/questionnaire.md
GitHub repo: https://github.com/screen-share/element-capture/
Primary contacts (and their relationship to the specification):
- Elad Alon, @eladalon1983, Google
- Mark Foltz, @markafoltz, Google
- Jordan Bayles, @baylesj, Google
Organization/project driving the design: Google Chrome and Google Meet
External status/issue trackers: https://chromestatus.com/feature/5198989277790208

Further details:

I have reviewed the TAG's Web Platform Design Principles
The group where the incubation/design work on this is being done (or is intended to be done in the future): Screen Capture CG
The group where standardization of this work is intended to be done ("unknown" if not known): Screen Capture CG and/or WebRTC WG
Existing major pieces of multi-stakeholder review or discussion of this design: TPAC 2023 SCCG / WebRTC WG joint meeting
Major unresolved issues with or opposition to this design: There are disagreements between Google and Mozilla on privacy concerns.
This work is being funded by: Google

You should also know that...

Strong positive Web developer feedback for this feature was expressed on https://github.com/screen-share/element-capture/issues/3 and during Screen Capture CG meetings.

Discussions

Discussed Aug 26, 2024 (See Github)

This sits on a list of features that Google has deployed without standardization (this is still in Origin Trial, I believe).

Problems here are the same as those inherent to screen sharing. Some fingerprinting risk, but the primary risk is that this enables capture of content that a site might not otherwise be authorized to obtain.

Dan: what's the permissions story? do they document abuse cases / mitigations?

Martin: The explainer ... they are relying on screen capture permissions model etc... they are just looking to double down on that.

Dan: so we are not asking the user to share...

Martin: there was a debate on what influence the application should have over the shape and scope of the permissions window... targeting screen sharing. One potential outcome is that the web site can share itself... there is a spec for that.

Martin: 625 was an APi for capturing the current tab... the key problem with screen capture - allows the web site to see what is on the page. It can frame in content from another site and screen capture the current page including that content... this is why we have permission prompts that put user interests ahead ... element capture gives the ability of not just capturing whats on screen but what is off-screen, including cross-origin...

Dan: in what cases cross-origin?

Martin: if the element you identity is or contains an iframe for example...

Dan: what if the spec said "no cross-origin"? would that mitigate?

Martin: that was one of the mitigations discussed... we also have viewport media function...

Martin: getviewportmedia has a site isolation requirement... which uses the same COEP for shared array buffer access... so from that perspective it's probably OK. Element capture in that case would be fine... GetDisplayMedia gets you some piece of screen realestate - GetViewpoertMedia gets the current tab...

Dan: is that feedback we could provide?

Martin: it looks like they required the target to opt in... that changes things a little bit. The cross-tab... you select a different tab and within that tab you want to focus on a specific chunk of that tab. and it looks like the target tab chooses what is captured...

Dan: I have a presentation running in one tab and I want to share it - but only the presentation, not the controls, to another tab that is webrtc session...

Martin: that looks like a good use case... this does look like it's cooperative...

Dan: Seems brittle. Needs coordination. Is it specific to the capturing or captured application or specific origins. Seems to lack detail.

Martin: Might ask for more detail.

Dan: Sympathetic to the use case. Either present the whole tab or the whole screen, but then you can't see people. Moving away from the tab can destroy the stream. Need to understand how this changes the permission flow. If you are locked in to a specific target... Are you allowed to override what a site says you can share? Could people be railroaded into a specific selection?

https://w3c.github.io/mediacapture-viewport/

Martin: let's say that you've identified that element and you can render that element to a stream - and not the elements "on top" of it - what happens with transparency?

<blockquote> Hi - some pieces of feedback from our TAG breakout this morning where we reviewed this:

It seems like the explainer is very lean. We think that there are a number of issues that need to be more fully explored before we can be more sure about this proposal.

In the use case that you're sharing a specific content area to an embedded iFrame (the use case in the explainer) what is the permissions flow for this scenario? For example - in current screen sharing scenarios, the user may be prompted to share a tab, a window, or the whole screen. What would the user be prompted for in this case? Would they be able to choose an alternative sharing target such as an other tab or the screen or is it envisioned that in this case they would be constrained to only share content from the designated application?

Can this be treated like an extension to ViewPortCapture? We note that this sort of sharing carries similar security risks at that API, and the additional constraints on capture in that API might be better suited to this use case than the more general getDisplayMedia.

The proposed API starts by preparing to share the whole of the content, and then restricting it to a particlar part - have you considered ways to start with the specific part to be shared instead? (How would this affect occlusion?

You have a goal of avoiding occlusion, but what about elements that are partially-transparent? Would this capture what is rendered behind an element?

</blockquote>

Comment by @torgo Aug 28, 2024 (See Github)

Hi - some pieces of feedback from our TAG breakout this morning where we reviewed this:

It seems like the explainer is very lean. We think that there are a number of issues that need to be more fully explored before we can be more sure about this proposal.

Can this be treated like an extension to ViewPortCapture? We note that this sort of sharing carries similar security risks as that API, and the additional constraints on capture in that API might be better suited to this use case than the more general getDisplayMedia.

The proposed API starts by preparing to share the whole of the content, and then restricting it to a particular part - have you considered ways to start with the specific part to be shared instead? (How would this affect occlusion?)

You have a goal of avoiding occlusion, but what about elements that are partially-transparent? Would this capture what is rendered behind an element?

Discussed Sep 2, 2024 (See Github)

Tweaked the labels, but this is waiting on authors/proponents.

Discussed Sep 9, 2024 (See Github)

Matthew: thoughts... first of all they responded positively - that's good. They posted a blog and some of this should have been in the explainer... would have answered some of our questions. One thing clatified is that they are doing occlusion - you will only get a video stream of that element (not occlusions). They also address the issue whether you should be able to share a particular element... didn't address something that might be off-screen. They do address the transparency... Also with respect to the API shape, we had a question - starting with the TAB and then drilling down - the reason they started that way - (1) it leands on existing infra for permissions and (2) that it sets expeections about what could be shared. I'm torn on it because I see what they're saying but it feels like extra work.

Dan: notes lack of multi-implementer support.

Matthew: one other thing - they said they did consider the alternative we suggested but there is no alternatives suggested in the explainer...

Martin: this is better than the original version but didn't address the concerns that Mozilla had... And I would like to have that discussion (about viewportcapture). I think the right thing would be to ask for the explainer to be updated... then we can move on to the next step.

matthew to ask them to update the explainer and ask about off-screen elements

Matthew's posted comment

Comment by @eladalon1983 Sep 10, 2024 (See Github)

(Note: Questions reordered to make the answers clearer, as later answers build on top of earlier ones.)

It seems like the explainer is very lean.

I aimed to make the explainer brief, and this article goes into more details and is more "instructive" in its tone. HTH?

You have a goal of avoiding occlusion, but what about elements that are partially-transparent?

Occluded content is "magic erased" from the capture as is occluding content. The article (link above) discusses this in detail, while the explainer, I acknowledge, only made passing and implicit reference to this fact ("frames produced on the restricted video track only consist of information from the target-element and its descendants"). Hope that's clear now. :-)

what is the permissions flow for this scenario? [...] What would the user be prompted for in this case?

This API builds on top of existing screen-sharing API, meaning that the permission flow remains entirely unchanged. An application would first call getDisplayMedia(...) or getViewportMedia() or any other past/future screen-sharing API, and the user would go through the usual selection process associated with it. It's only after this completes, if the user shares the (entire) current tab, that the Element Capture API can be invoked.

Can this be treated like an extension to ViewPortCapture?

That's an alternative approach that we have considered. But as of the time of writing, getViewportMedia() remains theoretical, several years after it was initially proposed. To ensure impact, we have shaped Element Capture to be agnostic of whether gDM, gVM or any other API produced the track which is being "restricted" by our new API.

[...] have you considered ways to start with the specific part to be shared instead?

I actually think that starting with the entire current tab, is a strength of the current API shape, because we lean on established methods to prompt the user to share something they know is compromising, and avoid giving them the false sense of security, that they are sharing "less". Imagine a user, for instance, sharing "just the X iframe" and not realizing that it could, at any moment, be navigated, or load cross-origin resources... But sharing the entire current tab, that's a concept users already understand, and they know that it requires elevated trust.

Comment by @matatk Sep 11, 2024 (See Github)

Thanks for your detailed reply @eladalon1983. The article you linked to answers several questions; thanks for that too. It would be really helpful for review, and future reference, if you could that content from the article into the explainer; it's OK to give a bit of the 'how to' info, as long as the explainer starts with the user needs being solved. That info, and the code snippets, helps to convey the intended API shape.

There are a couple of additional things that we'd really like to see in the explainer:

Please could you include some info on whether elements that are outside of the viewport (i.e. elements that are not visible) would be capture-able? (Your info on display: none is noted.)
We asked about Viewport Capture, and in your reply, you address this as one alternative that was considered. Please could you include an 'Alternatives considered' section in the explainer, and mention the rationale re Viewport Capture - and any other approaches that you considered?

Thanks in advance; we are looking forward to learning more about the above.

Discussed Oct 7, 2024 (See Github)

Matthew: we had a load of questions and it turned out that most of the questions were addressed in the article. We asked them to put it in the explainer. A couple of the q's we had were not answered... we asked them. No feedback. I think most of our concerns were addressed. It feels a little weird - but they are doing compositing properly - removing transparency... but it would be good to have this alternatives considered section... ball's in their court.

we ping the issue to get feedback

Comment by @matatk Oct 9, 2024 (See Github)

Hi @eladalon1983, we discussed this review in our call today - just checking if you have any feedback with respect to our comment/questions above?

Discussed Oct 14, 2024 (See Github)

Still waiting for a reply from proponents.

Comment by @eladalon1983 Oct 16, 2024 (See Github)

Apologies for taking some time here. I'll respond soon.

Comment by @eladalon1983 Oct 25, 2024 (See Github)

I have updated the explainer; PTAL.

Discussed Nov 11, 2024 (See Github)

Matthew: some further info... They updated the explainer with some of the info we requested - explained occlusion and transparency, code samples, expect Lea might have some thoughts on the API shape. They didn't answer the question about "alternatives considered what about viewport capture"... we might want an answer to that... Explainer diff

Jeffrey: seems they have answered that in the main thread ... We can mention again in our closing comment but it shouldn't block us...

Dan: Any changes to the spec itself?

Matthew: No but much better explained now. We could bump to the plenary with proposed closed.

Peter: I don't see any issues with the API shape...

Lea: on cursory examination seems fine...

we agree to close by the plenary - Lea will raise any issues if she finds them

Comment by @martinthomson Nov 14, 2024 (See Github)

Regarding the name RestrictionTarget, please consider using a name that makes it clearer that this is related to (screen) capture. Those words are very generic. If that name were to appear outside of a capture context, it would not be clear what it means.

Discussed Nov 18, 2024 (See Github)

Matthew: I closed and documented...

closed

Discussed Nov 18, 2024 (See Github)

Matthew: Martin had a question to ask... We didn't know what resolution to use. Martin says "satisfied" is OK. Jeffrey would said "Satisfied with concerns" seems ok.

Jeffrey: happy with either. Closing is the most important.

Matthew to post comment and close.

Dan: satisfied with concerns might be more appropriate to nudge them

Peter: sounds good

closed

Comment by @matatk Nov 19, 2024 (See Github)

Thank you for the explainer updates, @eladalon1983. We are happy with this proposal overall, but please could you consider @martinthomson's concern above regarding the naming of RestrictionTarget?

We don't have anything further to add, so we'll close this. The "satisfied with concerns" status is to reflect that naming concern.