#392: Scroll To Text
Discussions
Discussed
Jul 10, 2019 (See Github)
Peter: haven't there been multiple things like this:
David: an IETF thing
Tess: and an epub thing (epub-cfi). bunch of prior art.
Peter: also annotations...
Hadley: do we have past TAG reviews we could link to?
Tess: at least in the case of the similar epub technology - it has a syntax for doing assertions... and an algorithm for figuring out what the likely correct thing is. some robustness... I guess assign me.
David: i guess I could be. i've wanted this for a long time
Comment by @dbaron Jul 10, 2019 (See Github)
Is it worth having any similarity to RFC 5147's mechanism?
Comment by @torgo Jul 10, 2019 (See Github)
Hi - just noting that the assessment of the questionnaire appears blank.
Comment by @nickburris Jul 10, 2019 (See Github)
Is it worth having any similarity to RFC 5147's mechanism?
The problem with using character positions is that the references would be outdated much more quickly, for example if this were to be used for cross-references on Wikipedia, where articles are updated often. Naturally, text references will also go stale over time, but only when the actual target text is modified.
Hi - just noting that the assessment of the questionnaire appears blank.
Sorry - I misunderstood the template. Here is the questionnaire: https://github.com/bokand/ScrollToTextFragment/blob/master/security-privacy-questionnaire.md
Comment by @aarongustafson Jul 11, 2019 (See Github)
Some additional thoughts for consideration based on research & exploration we’ve been doing: https://github.com/MicrosoftEdge/MSEdgeExplainers/blob/master/Fragments/explainer.md
Comment by @annevk Jul 12, 2019 (See Github)
I couldn't find a description of how "Restricted to pages without an opener (no window.open)" is managed. (In particular, if A1 opens a popup A2 which then navigates A1 to V, V won't have an opener, but we certainly don't want this to work there.)
Comment by @nickburris Jul 15, 2019 (See Github)
I couldn't find a description of how "Restricted to pages without an opener (no window.open)" is managed. (In particular, if A1 opens a popup A2 which then navigates A1 to V, V won't have an opener, but we certainly don't want this to work there.)
The case you describe would indeed circumvent this. I'm not sure how we can mitigate this case - does anyone know of any precedent for this kind of restriction in spec already?
Comment by @nickburris Jul 23, 2019 (See Github)
FYI, we're currently exploring a way to improve web compatibility by placing the targetText behind a delimiter - e.g. a ## in the URL fragment separating the targetText which would then be stripped off and hidden from the page. This mitigates issues with sites that use the fragment for other purposes, e.g. https://www.webmd.com/skin-problems-and-treatments/lice-treatment#3##targetText=other%20options.-,Wet%20combing%20is%20one,or%20olive%20oil.,-But%20these%20may (from chromium bug 961440)
This is explored in https://github.com/bokand/ScrollToTextFragment/issues/15#issuecomment-496595958 and we'd appreciate TAG reviewers' thoughts on this issue!
Discussed
Jul 31, 2019 (See Github)
Alice: we were waiting for... (goes through comments and summarizes them)
David: the two hash thing may actually be interesting architecturally; part of the URL that the page doesn't have access to
Peter: there's a lot of work that's been done with media fragments; seems like this should be compatible with that
Peter: I remember Doug Schepers talking about an annotation system.
Tess: There was a whole effort -- workshop, community group of some kind, which may still exist. Artefacts are still out there. I want to say the folks working on scroll to text have already looked at that; at least I remember offhand seeing they'd already seen that.
Peter: how to proceed?
Tess: Two things important: (1) adding something to URL not visible to page; media fragments didn't do that. They came up with a different syntax that is page visible. What's important about hiding from the page?
Alice: why hidden from the page? I recall seeing, but forgot.
Peter: I see concerns about pages that already use fragment ids for keeping navigation state, and they're worried about conflicting with that.
Tess: why more of a problem in this case than for media?
David: because it's a user feature? Users invoke new feature that the page doesn't expect, want to share URLs?
Alice: Explanation in https://github.com/WICG/ScrollToTextFragment/issues/15#issuecomment-506347279 . Concern about apps doing their own hash parsing. Also causes the text search part to not appear in the URL, so it would be an odd transient thing. So tradeoff between backwards compatibility for sites that do their own hash parsing, versus not changing the URL and consistency with media fragments etc.
Peter: Bigger than just consistency, adding a new delimiter to URLs. Big new feature. Something we should look at.
Alice: When would there be an actual conflict between -- when would you link to a fragment in a single page app?
David: I've seen examples... like cryptpad.
Sangwhan: Youtube for TVs.
David: Map examples -- do this with hash, not great example for text search.
Peter: How will these URL be minted? From user doing text search in browser? Or something page will generate?
Hadley: If the page generates it on its own, isn't it then pretty close to how anchors already work. Looking at search use case, thinking of something like schema.org -- pages provide some semantic context, but heavy lifting done by crawler/indexer.
Peter: Search engine use case would definitely be the url being minted by something outside the page.
Peter: sounds like we have some issues to raise -- someone want to write them up?
Hadley: One more issue: internationalization. In security section, potential threat is exfiltrating data from destination site. One proposed mitigation to say it only matches on word boundaries. Not all languages have visible word boundaries -- cites UAX 29 supplemeted by word dictionary, as done by ICU project's boundary analysis.
Peter: Thai example requiring dictionary; CSS doesn't have concept of word.
Tess: On the one hand, ICU is widely used in browser engines... though should we have specs that require a specific library, or should we have a spec for this so that we're not just depending on ICU. So we could reference a spec rather than an ICU behavior.
Peter: Somebody want to write this up?
Alice: I can try.
Peter: Bump issue a few weeks?
Comment by @dbaron Aug 7, 2019 (See Github)
We had a discussion about this at our telecon last week.
Comment by @dbaron Aug 7, 2019 (See Github)
Another question about this: what's the story for feature detectability?
Comment by @bokand Aug 8, 2019 (See Github)
Another question about this: what's the story for feature detectability?
I brought this up in WICG/ScrollToTextFragment#19 - We should have some way to detect this feature but I'm not sure what the best way would be. Since there's no new JS APIs there's no obvious place to put it. Perhaps a bool on navigator
? Or on URL
? Neither feels especially intuitive. Is there an existing place for feature detecting things without a matching JS/CSS API?
For ##
, we could feature detect that with (new URL('https://test.com/##)).hash.indexOf('##') >= 0
but we probably don't want to tie ##
to targetText
.
From some of the questions in the notes:
Peter: I remember Doug Schepers talking about an annotation system.
Probably WebAnnotations? We did look at this and our syntax is quite close to the TextQuoteSelector, the major difference being that we allow essentially a wild-card match on the exact portion to allow a more compact representation of a long snippet.
Peter: there's a lot of work that's been done with media fragments;
We also looked at media fragments and initially wanted to do something similar. However, we ran into the compatibility concerns for pages that don't expect a hash/use it for their own processing.
How will these URL be minted? From user doing text search in browser? Or something page will generate?
We expect two major use cases: external pages pointing to a sub-resource (e.g. search engine, Wikipedia references) as well as users highlighting a snippet and copying a direct link to it. In both these cases we're generating a hash for a page without it knowing about it prior. Hence why conflicting with existing uses of the hash for routing is a concern; we expect a large number of links containing a hash to pages that would previously not expect it.
The case of internal (within an origin) anchors is less interesting because it's already possible. Since the author controls both the anchor and the pointed to resource, they can simply annotate the desired location with an element-id and use a regular element-id fragment (and provide highlight styling using :target
)
Comment by @BigBlueHat Aug 13, 2019 (See Github)
The Web Annotations WG did also create a fragment selector proposal as part of a note and there are a handful of existing implementations of those among web annotation tool providers--Apache Annotator among them (see demo).
So...Web Annotation Fragment Selectors:
https://annotator.apache.org/demo/#selector(type=TextQuoteSelector,exact=annotated%20world!))
vs.
https://example.org/##targetText=annotated%20world!
The Web Annotation Data Model (and it's Open Annotation predecessor) are widely used in digital heritage communities such as those using the International Image Interoperability Framework or projects like Pelagios or online article publishing tools like dokie.li. Additionally, general purpose web annotation systems like Hypothes.is and Pundit support Web Annotation Data Model compatible exports.
All of these communities would benefit from a text selection fragment identifier which was compatible with the Web Annotation Data Model structure such that conversions could be made between the two by existing implementations.
Discussed
Aug 21, 2019 (See Github)
David: I had a note to schedule a breakout...
Alice: didn't we discuss this at our last breakout?
David: yes - I think Tess was going to write something..
Alice: let's bump it until she's back
Comment by @torgo Aug 28, 2019 (See Github)
@dbaron to schedule a breakout to discuss this week with a fall-back of discussing it at the f2f.
Discussed
Sep 4, 2019 (See Github)
David: we did talk about it last week...
Tess: yes, we did talk about it...
David: we struggled to draw conclusions. I was gonna go ask Anne what the right forum for talking about URL changes is. Anne got back to me. I haven't proxied that response yet.
David: one of the underlying question is "who are you supposed to talk to about fundamental changes for URLs"... [So difficulties] but at the same time it seems like a decent idea.
Peter: bump to TPAC? There might be more people involved with URLs.
Alice: do we have a good dislation of the things we're struggling with about it?
David: one thing is: if you are going to change URL syntax who do you need to socialize it with?
Alice: so, yes the design seems OK but wer'ee not sure about changing URL syntax.
Peter: for me i have concerns that micro-syntax is overloading fragment identifiers... everything is stepping on everything else...
David: that's why they are coming up with this new mechanism for a piece of the URL that is not expose to the site so it doesn't conflict with another thing.
Peter: but is this already being used by some sites...
Alice: yves points out that fragments need to be interpreted
David: it's a little more backwards compatable by using ##
Dan: We are kind of defenders of the URL.
Peter: f2f before tpac?
- WebTransport - @cynthia, @dbaron, @ylafon
- A toast UI element - @hober, @alice, @kenchris
- Import maps - @cynthia, @kenchris
- Portals - @hober, @cynthia, @dbaron, @kenchris, @plinss
- CSS Properties and Values API Level 1 - @hober, @alice, @dbaron
- Trusted Types - @hober, @cynthia, @alice, @hadleybeeman, @plins
Comment by @dbaron Sep 4, 2019 (See Github)
This looks like an interesting proposal, and one that addresses an important use case that I've wished for a long time would be better. One of our concerns is that it is making a pretty fundamental change to URL syntax, and both we and others need to review that change carefully.
The motivations for wanting to change URL syntax (having a part of the URL that isn't exposed to the page in any way, so that it doesn't break existing uses of the fragment, but can be shared by systems that link to the page) seem sound to me. The question is whether it's worth the cost of changing URL syntax.
I think this is something that we need to think about a bit more (given the TAG's involvement with URLs) but that probably also needs to be socialized more widely. One place to do so might be uri@w3.org
; there are likely others. It also wouldn't surprise me if some of the reaction there is quite critical.
Comment by @bokand Sep 4, 2019 (See Github)
Thank you for the feedback. I agree a change to URL syntax is a big deal and warrants additional scrutiny. I've already filed a bug on whatwg/url#445 (and related whatwg/html#4868) to try to get some visibility, I'll email uri@w3.org as well. Please let me know if there's other places I should tap.
One alternative we could consider and would appreciate thoughts on the tradeoffs:
##
as a delimiter is invalid in a URL today which means tools and apps could misinterpret these new URLs. Practically speaking, they've worked in all browsers and tools I've tried so far but that's a necessarily limited sample; I can't say with certainty what the compat impact of this change would be. There's some experience here from Fragmentation that there were apps that failed on such URLs, though I haven't seen the examples myself.
We could pick an alternative but valid delimiter (e.g. @@
, ::
, etc.) to use in a fragment for the fragment directive. For example: https://example.org#@@targetText=foo
. This is a valid URL as per today's URL spec; the changes would be entirely in the HTML spec in how the fragment is interpreted and the URL mutated when loading a document (we'd still strip off @@targetText=foo
).
The trade off here is that, because these would be valid URLs, we might be affecting legitimate web apps that already use this character sequence in their fragment. This may or may not be preferable to affecting URL parsing apps. One advantage here is that it's easier for us to measure the compat impact on web pages: we can add some telemetry and measure how often we see URLs with various candidate delimiters and pick one with acceptably low usage (assuming it exists).
Comment by @hober Sep 11, 2019 (See Github)
Comment by @hober Sep 11, 2019 (See Github)
We'll follow the rules set out in Unicode Standard Annex #29 supplemented by a word dictionary as done by the ICU Project's boundary analysis.
I'm a bit worried about interop here. Are you effectively requiring all browser engines to use ICU? If not, how can we ensure interoperability in a world where the word dictionary isn't specified / standardized? This is especially concerning given that the quality of word boundary analysis directly affects the efficacy of a security mitigation in this spec.
Ideally, a specific dictionary would be specced somewhere & normatively referenced here. If that's not feasible, I think this spec needs to set out robust normative requirements on the word dictionaries implementations may choose.
Comment by @hober Sep 11, 2019 (See Github)
I've marked this as pending external feedback
as we wait for the URL, HTML, and URI github & email conversations to conclude.
Comment by @nickburris Oct 8, 2019 (See Github)
Based on URI feedback, we've updated our proposal to use a fragment directive delimiter that is valid by URL spec, :~:
instead of ##
. This means we only need to amend HTML spec to allow :~:
to indicate the fragment directive.
I'm a bit worried about interop here. Are you effectively requiring all browser engines to use ICU? If not, how can we ensure interoperability in a world where the word dictionary isn't specified / standardized? This is especially concerning given that the quality of word boundary analysis directly affects the efficacy of a security mitigation in this spec.
We improved our spec on word boundary matching based on your feedback and discussion with @domenic: https://wicg.github.io/ScrollToTextFragment/#next-word-bounded-instance
Comment by @yoavweiss Oct 10, 2019 (See Github)
@hober - can you see if the word boundary changes address the interoperability concerns you raised?
Comment by @bokand Oct 10, 2019 (See Github)
It was pointed out in the I2S that we never resolved @annevk's point above.
I couldn't find a description of how "Restricted to pages without an opener (no window.open)" is managed. (In particular, if A1 opens a popup A2 which then navigates A1 to V, V won't have an opener, but we certainly don't want this to work there.)
Sorry about that, pasting my recent reply from there:
Apologies, we did go over this internally with our security reviewers but I forgot to reply on the thread. The outcome was that we consider this one of several mitigations, rather than a hard security boundary. Given that this means a popup is visible, and the attacker would need to phish user gestures, and they can only search on word boundaries, and they would still need some exploit to determine a cross-origin scroll, we felt that this wasn't concerning enough to add a ton of complexity to lock down further.
Comment by @annevk Oct 11, 2019 (See Github)
The concern would more be that by scrolling the page they can ensure important information is not visible in the viewport and might get the user to do something they did not plan on doing. (I realize this is more of a general problem with this idea and not specific to popups.)
Comment by @bokand Oct 11, 2019 (See Github)
But that's already possible using fragment id's right? i.e. Just find any element id that's away from the important information and link to that (most pages have ids).
Comment by @annevk Oct 11, 2019 (See Github)
With a lot less granularity, sure.
Comment by @bokand Oct 15, 2019 (See Github)
FYI: I've made public our security review doc which lists the threats that we've considered and some of our reasoning around the design: https://docs.google.com/document/d/1YHcl1-vE_ZnZ0kL2almeikAj2gkwCq8_5xwIae7PVik/edit#
Comment by @annevk Oct 17, 2019 (See Github)
Did you consider only allowing this cross-origin if the link/popup has noopener semantics?
Comment by @nickburris Oct 17, 2019 (See Github)
I think restricting to cross-origin in any case would inhibit serving same-origin text references, which is a strong use case of the feature.
Comment by @nickburris Oct 17, 2019 (See Github)
Sorry, I misread your comment. Yes, one of the security restrictions is that we restrict to "top-level browsing contexts without an opener" i.e. we check that window.opener is null, even for same-origin. Does this cover the case you're referring to? Or is there a case where window.opener is null but the noopener attribute wasn't specified?
Comment by @annevk Oct 17, 2019 (See Github)
Comment by @bokand Oct 25, 2019 (See Github)
I've spun the discussion related to popups and openers off into https://github.com/WICG/ScrollToTextFragment/issues/64 to avoid cluttering this issue. Please follow along in there if interested.
Comment by @bokand Nov 28, 2019 (See Github)
Ping: @hober - I think we've addressed the concerns around URLs, the processing has now been moved entirely into HTML document loading so there's no URL related changes.
Also, as @nickburris mentioned above, we've also done some work to make the word boundary requirements a little more detailed and specified. I don't think there's anything in this review that's still blocked.
Comment by @hober Dec 4, 2019 (See Github)
I wrote:
I'm a bit worried about interop here. Are you effectively requiring all browser engines to use ICU? If not, how can we ensure interoperability in a world where the word dictionary isn't specified / standardized? This is especially concerning given that the quality of word boundary analysis directly affects the efficacy of a security mitigation in this spec.
@nickburris replied:
We improved our spec on word boundary matching based on your feedback and discussion with @domenic: https://wicg.github.io/ScrollToTextFragment/#next-word-bounded-instance
This is definitely an improvement, thanks. I have more specific thoughts about the non-normative note beginning "Limiting matching to word boundaries," which are below. With the following changes, I'd be happy with this text.
Limiting matching to word boundaries is one of the mitigations to limit cross-origin information leakage.
It makes sense for this to be a non-normative note, though not as a note under step 4 of the algorithm. Consider moving?
A word boundary is as defined in the Unicode text segmentation annex.
This should be normative text, not in a note, and it should be somewhere other than step 4 of the algorithm.
The Default Word Boundary Specification defines a default set of what constitutes a word boundary, but as the specification mentions, a more sophisticated algorithm should be used based on the locale.
Dictionary-based word bounding should take specific care in locales without a word-separating character (e.g. space). In those cases, and where the alphabet contains fewer than 100 characters, the dictionary must not contain more than 20% of the alphabet as valid, one-letter words.
This contains normative requirements, and thus should be in normative text, somewhere other than step 4 of the algorithm.
Comment by @plinss Dec 4, 2019 (See Github)
The spec needs to list which content types this type of fragment applies to, e.g. does this only work for html? what about plain/text documents? SVG? etc...
I'm not asserting which types of documents this should apply to, but the spec needs to be clear as fragments are interpreted according to the content type of the document.
Comment by @hober Dec 4, 2019 (See Github)
@annevk left a comment on WICG/ScrollToTextFragment#70 after it was merged:
That might be doable, yes, but it would require patching HTML.
Is there an issue filed on HTML to track this work, @bokand?
Comment by @hober Dec 4, 2019 (See Github)
Marking as pending external feedback
while we wait for answers to the above questions, and pending editor update
for my word boundary nits.
Comment by @bokand Dec 5, 2019 (See Github)
This is definitely an improvement, thanks. I have more specific thoughts about the non-normative note beginning "Limiting matching to word boundaries," which are below. With the following changes, I'd be happy with this text.
Thanks for the feedback, we'll make the changes.
The spec needs to list which content types this type of fragment applies to, e.g. does this only work for html? what about plain/text documents? SVG? etc...
It's implied to be HTML documents only by section 2.3.4 allowTextFragmentDirectiveFlag since it makes the change in the HTML document loading steps. I agree it'd be useful to make this more explicit. Would a non-normative note be sufficient because of the above or does this need to be normative?
@annevk left a comment on WICG/ScrollToTextFragment#70 after it was merged:
That might be doable, yes, but it would require patching HTML.
Is there an issue filed on HTML to track this work, @bokand?
I believe this was in reply to my suggestion:
Also, w.r.t. target="_self" rel="noopener", could we make a link with a text directive imply noopener semantics? I would think this is preferable to requiring links to add noopener since there are cases where that'll be difficult for the user and generally make this more difficult to use.
I just replied on the PR. I changed my mind on the above, I don't think it's critical to this proposal (details there). I think @annevk's original idea about target="_self" rel="noopener"
would still be useful. Filed https://github.com/whatwg/html/issues/5134.
Comment by @bokand May 8, 2020 (See Github)
FYI: There's a question in whatwg/#5523 about whether we should move window.location.fragmentDirective
from window.location
to elsewhere (document.fragmentDirective
?) due to the quirks of window.location
. I'd appreciate if we got broader opinions on whether it's worth the work to make window.location
smoothly extensible or if it should be avoided in new APIs.
Comment by @annevk May 9, 2020 (See Github)
I don't think it's critical to this proposal (details there).
I replied there pointing out a flaw in the reasoning, but that was never followed up on.
Comment by @bokand May 12, 2020 (See Github)
Sorry - I still intend to get to it (along with a bunch of other outstanding issues) but have too many balls in the air. I'll carve out some time this week to look at that issue specifically.
Comment by @hober May 27, 2020 (See Github)
The spec needs to list which content types this type of fragment applies to, e.g. does this only work for html? what about plain/text documents? SVG? etc...
It's implied to be HTML documents only by section 2.3.4 allowTextFragmentDirectiveFlag since it makes the change in the HTML document loading steps. I agree it'd be useful to make this more explicit. Would a non-normative note be sufficient because of the above or does this need to be normative?
I think you'll need to make at least some normative statements, yes, though perhaps most of this point could be captured in a non-normative note.
Comment by @hober May 27, 2020 (See Github)
FYI: There's a question in whatwg/#5523 about whether we should move
window.location.fragmentDirective
fromwindow.location
to elsewhere (document.fragmentDirective
?) due to the quirks ofwindow.location
.
FWIW, I agree with @annevk's comment here: https://github.com/whatwg/html/issues/5523#issuecomment-626128421
Comment by @hober May 28, 2020 (See Github)
We took another look at this in this week's TAG F2F and we've decided to complete our review.
We're happy that you've tightened up the normative text around word boundaries that we were concerned with, and that you reached out to several other venues to solicit feedback (e.g. uri@w3.org).
We're (still) worried that this is a very significant change to URL syntax and processing that may not have gotten sufficient buy-in from the various relevant standards bodies and other stakeholders. We're also worried that this doesn't have a clear path to standardization. Where do you intend to take this after WICG?
Please file a new design review issue if you end up significantly altering your plans here. Thanks!
Comment by @bokand Jun 3, 2020 (See Github)
Thank you @hober and TAG for the review and constructive feedback!
We're (still) worried that this is a very significant change to URL syntax and processing
I'd just like to get clarification on this point - our original idea of mucking with URLs themselves was dropped. The proposal as it stands it entirely a change to fragment processing in HTML documents only. There are some significant changes here but this seems less scary than changes to URLs so I'd just like to confirm this comment is referring to the most up-to-date spec.
We're also worried that this doesn't have a clear path to standardization. Where do you intend to take this after WICG?
The current spec is monkey-patching HTML so I think it makes sense to move there if/when we can get additional implementer interest.
Comment by @lilles Nov 23, 2020 (See Github)
In addition, Blink now has an intent to ship support for the ::target-text
selector as specified in css-pseudo which supports styling the text fragment highlight.
Comment by @bokand Nov 23, 2020 (See Github)
We're (still) worried that this is a very significant change to URL syntax and processing
I'd just like to get clarification on this point - our original idea of mucking with URLs themselves was dropped.
@hober A ping on clarifying this point; this is entirely specified as HTML fragment processing. The spec is now much clearer about how processing the fragment directive works and what it means for various edge cases to do with URLs.
OpenedJul 2, 2019
こんにちはTAG!
I'm requesting a TAG review of:
Further details:
We recommend the explainer to be in Markdown. On top of the usual information expected in the explainer, it is strongly recommended to add:
You should also know that...
One of our major discussion points currently is how the targetText= indicator should be delimited in the URL fragment. See https://github.com/bokand/ScrollToTextFragment/issues/15. The latest idea here is that we could use a double-hash syntax (e.g. example.com#fragment##targetText=example) to avoid breaking websites that use the fragment for routing/state. The browser would parse the ##targetText= identifier and then remove it from the fragment.
We'd prefer the TAG provide feedback as (please select one):
Please preview the issue and check that the links work before submitting. In particular, if anything links to a URL which requires authentication (e.g. Google document), please make sure anyone with the link can access the document.
¹ For background, see our explanation of how to write a good explainer.