design-reviews#393: Layout Instability Metric

#393: Layout Instability Metric

Opened Jul 8, 2019

こんにちはTAG!

I'm requesting a TAG review of:

Name: Layout Instability Metric
Specification URL: https://wicg.github.io/layout-instability/
Explainer (containing user needs and example code)¹: https://github.com/WICG/layout-instability/blob/master/README.md
GitHub issues (if you prefer feedback filed there): https://github.com/WICG/layout-instability/issues
Tests: wpt/layout-instability (more coming soon)
Primary contacts (and their relationship to the specification): @skobes

Further details:

Relevant time constraints or deadlines: [please provide]
I have read and filled out the Self-Review Questionnare on Security and Privacy. The assessment is here.
I have reviewed the TAG's API Design Principles
The group where the work on this specification is: WICG

We recommend the explainer to be in Markdown. On top of the usual information expected in the explainer, it is strongly recommended to add:

Links to major pieces of multi-stakeholder review or discussion of this specification:

https://discourse.wicg.io/t/proposal-layout-stability-metric/3187

Links to major unresolved issues or opposition with this specification:

https://github.com/WICG/layout-instability/issues

You should also know that...

some tests are not yet in WPT
some properties (hadRecentInput, lastInputTime) are not yet described in the spec

We'd prefer the TAG provide feedback as (please select one):

open issues in our GitHub repo for each point of feedback
open a single issue in our GitHub repo for the entire review
leave review feedback as a comment in this issue and @-notify [github usernames]

Please preview the issue and check that the links work before submitting. In particular, if anything links to a URL which requires authentication (e.g. Google document), please make sure anyone with the link can access the document.

¹ For background, see our explanation of how to write a good explainer.

Discussions

Discussed Jul 10, 2019 (See Github)

Alice: it's new. they are wondering if they can get an expedited review. I can explain.

Alice: they are adding an extra interface into the perf API to get metrics on performance to send back to the server to test whether your changes have impacted performance. new performance entry type. it is aimed at measuring jank. They have a jank score.

Lukasz: measuring how far the current layout is compared to the layout that is supposed to visible...?

Alice: it's more about - what does the user perceive about how stable the layout is. E.g. ads that load that cause the page to jump when you load it.

Lukasz: sounds engine dependent and user dependent ? Depend on engine, network, related to the user's... this info will only be provided to the web site?

Alice: as opposed to what?

Lukasz: might be misued by web sites ... like use an adversarial layout... not sure if it's possible to devise a blueprint of an adversarial display that ... may be overthinking.

Sangwhan: looking at the algo it looks like if there is a font that loads too late this will have a really large value.

Alice: excellent question ... to leave on the review.

David: I am generally worried both about whether we will be able to make it interoperable and also whether it's the thing that will be the best to optimise for or whether it will lead people to optimise for the wrong thing.

Alice: is optimising for the wrong thing still an imporovement?

David: often but not always.

Peter: question on how does the performance observer API work... does it fire any time layout changes? Root question is : are we expediting this, who is doing it and when?

David: i am happy to be involved.

Alice: Lukasz do you want to be on this?

Lukasz: sure at least about the issues I mentioned. yes. put me down.

Alice: i will assign myself.

Peter: milestone?

Peter: noting 2 things on the agenda for next weeek

David: let's try

[set for next week

Comment by @lknik Jul 10, 2019 (See Github)

Hi,

We had an initial look during 10/07/19 telecon.

What we (and myself) are wondering are:

This looks like an interesting metric, and potentially something developers would choose to optimise for. Is it a good idea to focus optimisations to improve this particular score? Any particular unobvious concerns come to your mind?
Any general estimations on the score variations? How stable is the metric? Would it be particularly affected by dynamic elements like fonts?
How frequently is the score updated? At any change? Would that add to any further performance impacts?
Lastly: the metric would depend on the engine, the CPU, the network, etc. So site and possibly also user dependant. While access is to be "limited" to the particular site only, but can we reasonably imagine an "adversarial" design that would be crafted in a way to extract something attributable to the user's computing setting across sites?

Comment by @skobes-chromium Jul 10, 2019 (See Github)

Thanks for the feedback!

We see this metric as part of a set of metrics, alongside First Input Delay (FID) and First Contentful Paint (FCP), that provide insight into user experience. Developers should not optimize for any single metric to the exclusion of others, but should optimize for all of them together.

We have found that web fonts are a common source of layout instability. I think platform improvements are needed to support web fonts with stable layout. There has been some discussion of this, e.g. on crbug.com/898083. The score can also vary with things like ad selection.

The layout shift score is generated for every rendering update that contains layout shifts. These updates are throttled to the speed of animation frames (vsync). There could still be a performance impact if the developer does something expensive in the PerformanceObserver callback.

A website has many ways to deduce information about the user's computing capabilities for purposes such as fingerprinting. The layout instability metric is one such way, but we don't think it exposes information that couldn't be just as easily obtained through other APIs that measure the speed or query the state of layout.

Discussed Jul 17, 2019 (See Github)

Lukasz: The ? part of the reply I'm not commenting on. There are more metrics out there. The last part says we have nothing to worry about because there are other ways to do that... do what? Can I ask what it is that might be an issue? I'm more and more skeptical about this point of view -- just because something is possible with something, doesn't mean it's ok to add other things that do the same. Or at least have it catalogued.

Dan: Sounds good to me. Often claims of no privacy concern say "the information can already be found with this other API". I'm skeptical; is that a reason to dismiss that the new API can be used to get private data. I think it's worth asking that question.

Lukasz: So I'll reply.

Dan: Other thoughts?

David: I'd like to look at it more. For the interaction with layout perspective.

Comment by @lknik Jul 17, 2019 (See Github)

Hi,

Thanks for your input. Here I'm only going to refer to the last point.

A website has many ways to deduce information about the user's computing capabilities for purposes such as fingerprinting. The layout instability metric is one such way, but we don't think it exposes information that couldn't be just as easily obtained through other APIs that measure the speed or query the state of layout.

I'm not suggesting there are any unprecedented new risks or methods of using that, so please don't get me wrong. Still I would suggested in thinking in these ways:

what does it allow/enable?
is it possible to do this now, if so with what and how?

Would be great to have it documented.

Also last and not least, I believe that the score value is double. In your view in practice would be what scale [0,x] in practice (or theory)? Can it be bounded?

Comment by @skobes-chromium Jul 17, 2019 (See Github)

These are good questions to think about. It's already possible for a site to query an element's bounding rect, but it is infeasible to query every element at every point in time. So I would say the metric enables observing layout changes in a way that isn't efficient today.

The explainer contains some comments under "Privacy and Security" about the possible use of the API for fingerprinting, and about cross-origin subframe restrictions which the API respects.

If you have other ideas about unexpected capabilities the API might enable I would be interested in hearing about them.

I think a rough guideline on scoring might be 0.0 - 0.1 = low jank 0.1 - 0.5 = medium jank 0.5 - 1.0 = high jank

Scores above 1.0 are possible but the differences become less meaningful. (The highest scores tend to be sessions with auto-repeating layout-inducing animations.)

I expect we'll refine our understanding of this over time based on the data we observe in the wild.

Comment by @alice Jul 19, 2019 (See Github)

We're wondering what the effective precision of this metric is, given the guidelines you outline above and this comment in the explainer:

It is intended that the LS score have a correspondence to the perceptual severity of the instability, but not that all user agents produce exactly the same LS scores for a given page.

If that is the case, would it make sense to expose this as, say, an integer between 1 and 10 (assuming variance within those 10 buckets is essentially noise), and truncate values higher than 10 to 10?

Comment by @dbaron Jul 19, 2019 (See Github)

So another question here: it seems like figuring out what developers should optimize for in terms of stability during load is a question that has active research. At least, I don't think there's an agreed-upon answer for what the perfect thing to measure is.

So for this specification, it feels like there's a tension between (a) specifying something that's precisely interoperable versus (b) specifying something that can be improved on over time (e.g., where browser engines could have different opinions on what type of instability is best/worst for their users, and could improve that metric over time).

I'm curious both where you think the current spec fits along that tradeoff/continuum, and what you think advantages/disadvantages of shifting to a different point along that continuum would be.

Comment by @dbaron Jul 19, 2019 (See Github)

Also, is there a plan to move this work to the Web Performance WG at some point?

Comment by @lknik Jul 22, 2019 (See Github)

Thanks for your answers @skobes!

These are good questions to think about. It's already possible for a site to query an element's bounding rect, but it is infeasible to query every element at every point in time. So I would say the metric enables observing layout changes in a way that isn't efficient today.

So it seems to me it is in fact introducing something that is not possible today (it might be possible, maybe, with different approach, less efficiently, etc?) after all. Why not document this down?

Any view on (ways of) manually introducing element jitter? What that be seen by this metric?

I think a rough guideline on scoring might be 0.0 - 0.1 = low jank 0.1 - 0.5 = medium jank 0.5 - 1.0 = high jank

Scores above 1.0 are possible but the differences become less meaningful. (The highest scores tend to be sessions with auto-repeating layout-inducing animations.)

Looks good. Would it be possible to reduce values>1 to 1, and perhaps even keep just 0, 0.2, 0.5, 0.7, 0.9, 1? Alternatively 0.{1,2,3,4,5,6,7,8,9}, 1.

I expect we'll refine our understanding of this over time based on the data we observe in the wild.

How long would this observation phase last?

Discussed Jul 24, 2019 (See Github)

Alice: no responses since last week. Numbers don't seem to be very percise... variation between browsers. [some discussions of comments left on issue]. How do yoi measure this in a way that makes sense...

Peter: Florian raised a bunch of concerns - message sent to blinkdev... july 20th.

[...reading...]

Alice: some of his comments are spec issues... [editorial] ... don't understand the fragmention comment... he makes the point that there are spec issues that may cause trouble for interop.

Dan: can we ask them to come back to us when they revise?

Alice: we are waiting on responses to issues we raised - we could just set it to progress: pending external feedback ?

[discussion on whether to bump]

Alice: how do we recapture things once they have feedback...

Dan: meta issue - let's discuss - maybe we can use a bot

Peter: bot that looks for issues on pending feedback and looks for new comments from non tag members and flags that..

Comment by @dbaron Jul 24, 2019 (See Github)

There's a bunch of useful feedback in @frivoal's post to blink-dev.

Discussed Aug 21, 2019 (See Github)

Alice: Breakout in which we will file issues - regular breakout time, 5pm Thursday US Pacific time

Comment by @dbaron Aug 23, 2019 (See Github)

Parts of @frivoal's post appear to be covered by WICG/layout-instability#19, WICG/layout-instability#20, and WICG/layout-instability#21.

Comment by @dbaron Aug 23, 2019 (See Github)

So it seems to me it is in fact introducing something that is not possible today (it might be possible, maybe, with different approach, less efficiently, etc?) after all. Why not document this down?

I think it probably is possible today by more directly observing the things that cause the layout instability: put a bunch of load events on images and fonts, record the times from them, also record the times from any key points in script execution that would modify the layout, etc. This would more directly record the performance measurements than the layout instability score, which would really be a lower-information proxy for those other things.

Discussed Sep 4, 2019 (See Github)

Alice: I filed a couple of issues.

David: Alice & I filed some issues. A response came in 14 hours ago.

Alice: In terms of granularity - made a good poont that the cumulative score is what is meaningful. Exposing the raw score with developer guidance is what needs to happen.

David: their responss to comments seem reasonable.

David: should this be in webperf?

Dan: can we close it?

David: I'm OK closing it. Just typing up a comment that it would be useful to have guidance in the spec? We can lgtm and close it

Comment by @dbaron Sep 4, 2019 (See Github)

We've filed a few issues (see links above), and I think we're reasonably happy with the responses so far. Some of those issues do reflect things that we'd like to see documented better in the spec, but we don't think we need to keep this review open for that.

This also isn't a particularly detailed review; we hope that as this spec stabilizes it moves into a forum (most likely the Web Performance WG) where it can get appropriate detail-level review from other potential implementors.

So thanks for requesting TAG review.