#619: Early design review: Sanitizer API

Visit on Github.

Opened Mar 23, 2021

I'm requesting an early design review of the Sanitizer API

Provide a browser-maintained "ever-green", safe, and easy-to-use library for user input sanitization as part of the general web platform.

Further details:

We'd prefer the TAG provide feedback as:

Discussions

Discussed Mar 29, 2021 (See Github)

Lea: I looked yesterday - so first off lots of use cases. It's work that should be happening. Sensible defaults. Even though they have a detailed dictionary - there are some things not expressed through that. For example allow lists and block lists - but what about javascript: links on <a> elements. Strange to take a string as an input and return a document but I can see the reasoning. .. Unclear how someone can remove a blocked element through the default list. Not clear how you can drop one element from the allow list without specifying the entire allow list again. Noticed that the config not exposed on the sanitizer instance... That could help with customizing the default config.

Hadley: why would you want to do this?

Lea: you want users to be able to write arbitrary HTML - but this opens up xss security issues. There is a library called dompurify which is widely used that they are basing their work on. It would be good to have something like this in the browser.

[we can come back to it in the plenary]

Comment by @LeaVerou Mar 29, 2021 (See Github)

I'm happy to see work in this direction. This is so frequently needed, and requiring authors to seek and include a library to provide basic security for any app that handles HTML was not ideal.

I took a look at the proposed API and will outline my thoughts below.

I like the sensible defaults approach, and the fact that "simple things should be easy" is an explicit goal. It's good that there is an optional config to customize these defaults, to cater to more specialized use cases. However, I noticed that even though there's a very long default config, it does not describe or afford customization of all rules required to properly sanitize HTML. For example, neither <a> elements, nor href attributes are dropped, but javascript: URLs need to be removed, which I imagine is part of the default sanitization. However, there is no way to specify or remove this behavior in the config, or any behavior relating to values of attributes. It would be good if the config could expose and afford customization on all sanitization rules. This could also allow for a more granular way to avoid DOM clobbering than stripping all ids. Perhaps this is what you mean by additional configuration options for 2.0?

It is unclear how authors can remove an element from the list of allowed elements, without re-specifying the entire allowlist of the default configuration. My understanding is that if they specify it in a block list, the behavior of the sanitizer changes to "anything goes except the blocklist". Exposing the config on Sanitizer instances (possibly read-only) could provide an easy solution for this use case. Exposing the default config as a static property on Sanitizer might be helpful too.

While I understand the reasons behind making the main sanitize() method return a DocumentFragment, it does feel a little unexpected from a DX perspective. Also, given that sanitizeToString() is merely a helper for serializing document fragments, I wonder if it would be better for the Web Platform if a property/method was added to DocumentFragment for serialization, instead of defining ad hoc helpers on more specialized APIs.

Comment by @otherdaniel Apr 1, 2021 (See Github)

Hi Lea, Thank you for the feedback! A quick round of replies:

javascript:-URLs and expressiveness: This is an open issue. My current thinking -- reflected in the current spec draft -- is that simplicity is key to get web devs to use the Sanitizer API and that consequently we should try hard to keep this simple and usable and to cover the basics first. This is corrobated by several authors from exisiting sanitizer libraries, who warned us that we'd almost instantly get into a "feature race", and that it's worth clamping down on feature requests in order to keep overall complexity at bay.

The way the current spec draft handles this is:

  • There are no ways to specify content-dependent sanitization. Elements or attributes are allowed, or blocked, but the config provides no way to be more granular.
  • There are a handful of cases that don't fit into this framework. In these cases, we'll add special case handling to the spec. This is a little unsatisfying, because this creates "magic" behaviour that a user couldn't re-create with a config.
  • You mention javascript:-URLs, which are the perfect example. This is handled in handle funky elements (rules 2-4).

Several proposals for more expressive configs are discussed in https://github.com/WICG/sanitizer-api/issues/26 . We marked this issue as "v2".

"unclear how authors can remove an element from the list of allowed elements": This is true, I'm afraid. The idea is that authors can either supply their own allow list (which makes most sense if they have a specific goal in mind, e.g. only formatting), or they supply a block list (with which they can effectively eliminate individual elements from the defaults). So I think the capability is there, but it's arguably laborious to discover it.

Exposing the default config is a good idea; we should do that.

Exposing a Sanitizer's config is something we've had early on, but then removed it for lack of a clear use case. We should reconsider this.

general readability: This is arguably a follow-on to the previous point. But I think this is also touched in the "very long default config" thought. I think of this as an editorial problem: The problem I'm having as spec editor is that I'm trying very hard to be unambigous, and to emulate the "algorithms" writing style of the HTML family of specs. This does, unfortunately, make the spec very dense and quite hard to read. This has come up repeatedly. For the specific audience of "spec imlementors" this is a good choice; but arguably not for anyone else.

I'm a bit at a loss of how to improve this. My current thinking is to continue specifying this as we currently do and to assume spec implementors are the primary audience of the spec, and to provide more & better additional, developer-focused documentation seperately.

If you know of any specs that do a particularly good job of providing dense & precise information, while still remaining very accessible to multiple audiences, I'll gladly take a look.

sanitizeToString vs. serializing a DocumentFragment directly: I agree. I have no experience with those parts of the spec universe, though, so I'm not sure whether this is easy to enact.

Comment by @otherdaniel Apr 21, 2021 (See Github)
  • We have added a feature for configuration introspection. (WICG/sanitizer-api#77).
  • We have also added prose to explain how we arrived at the built-in constants (WICG/sanitizer-api#78), although the constants remain as the normative text.
Discussed May 1, 2021 (See Github)

Lea: we've given feedback, they've made some changes and explained why they can't address others. I think we're done here. Tess?

Also, sanitize to string vs sanitize: they have two methods, and they've received criticism for that. We pointed out that a document fragment serialization algorithm doesn't belong in this disucssion. So I opened an issue in WHATWG. I think it's stalled, but there wasn't push back against it.

Hadley: do we have danger if we bake the algorithm into the spec? With crypto, we've said "expect algorithns to evolve faster than the spec"

Tess: Yes and no. the editors of the HTML spec will have to be vigilant and keep this up to date, and browser engineers to be vigilant and implement those changes. But that's better than the status quo. Sanitisation is done by JS libraries, which may/may not be maintained, on sites that may/may not be maintainted. So this probably a win over the status quo.

Lea: Agreed. In most cases, it would be tweaking the default config.

Tess: right. Suppose some broswer impelenents a new element befoer others, or suppose a browser engine has implementations of non-standard elements that aren't exposed to the web but are exposed to other clients. Ex: webkit has <attachment>, which is not exposed to the web. It's used for mail clients, to signal the web view.

In webkit, I would expect the default block list for the sanitiser to include <attachment> when that element is enabled.

Lea: Does their spec allow for default config?

Tess: it's not a browser. The mail client would have this turned on (add attachment to the blacklist), and would be great client of this sanitizer API. If you think about parsing a mail message itself, you're parsing the hierarchy of MIME parts. Each step of THAT is untrusted. People try to do buffer overruns in mail clients by having weird names for attachments.

This seems fine.

I agree that sanitiser strings seems misplaced, if we want to way to serialise a document fragment, we should just have one. And you'd just sanitise the string by calling this.

Lea: Re UAs supporting proprietary elements and needing to add to config... could this apply to browsers? Would it be helpful to allow UAs to extend this default config?

Tess: I think UAs should be allowed to extend the default config mostly because... I think we want the default config to be safe. And authors to have to opt in to less safety. But you could imagine a browser like Brave might want one even stricter than default.

Lea: or the user to customise it

Tess: sure.

Browsers are free to violate specs, and if they do in practice we can update the spec to match that reality.

We shouldn't add it unless we need it.

hadley: what definition of funky?

Tess: I'm proud that they have a definition of handleFunkyElement. hyperlinks with js URLs, form elements whose action element has a js url, and input/button elements that have a form action attribute with a js url.

Hadley: 'funky' is not a very helpful descriptor, may not be useful for people outside the group.

Lea: I don't have a problem with 'funky'. My main issue is that it talks about elements but 75% of it is attributes.

Tess: moving this functionality into the browser is much better, because the browser is more likely to be aware of new elements in HTML. We added the template element not long ago, and there are lots of sanitiser libraries that don't handle it. I'm inclined to say thanks for doing this important work, let us know how it goes.

Lea: I'll comment and propose closing.

Comment by @LeaVerou May 12, 2021 (See Github)

@hober, @hadleybeeman and I looked at this during a breakout today.

We are glad to see the recent improvements based on earlier feedback. Note that I wasn't saying that the "very long default config" wasn't readable, just that the longer it is, the more tedious it would be for authors to replicate it. However, now that it's exposed, that wouldn't be a problem.

We all thought that Document Fragment serialization should be exposed via a more general method instead of residing in this API. I did open an issue on this a while ago, and it is ongoing.

Overall, we're happy with the direction this is going, and are considering closing this review. Thank you for doing this very valuable and highly needed work!

Comment by @mozfreddyb May 17, 2021 (See Github)

Thank you for concluding this review! Just one quick question..

We all thought that Document Fragment sanitization should be exposed via a more general method instead of residing in this API. I did open an issue on this a while ago, and it is ongoing.

Am I right to assume this is a typo and you were saying that Document Fragment serialization should ideally be exposed someplace else?

Comment by @LeaVerou May 17, 2021 (See Github)

Yes, fixed now, thanks!