#282: TextEncoderStream and TextDecoderStream

Visit on Github.

Opened May 22, 2018

Hello TAG!

I'm requesting a TAG review of:

Further details (optional):

You should also know that...

This is an extension to the existing Encoding Standard to integrate with the Streams Standard. There has been extensive discussion at https://github.com/whatwg/encoding/issues/72.

We'd prefer the TAG provide feedback as (please select one):

  • open issues in our Github repo for each point of feedback
  • open a single issue in our Github repo for the entire review
  • leave review feedback as a comment in this issue and @-notify [github usernames]

Discussions

Comment by @slightlyoff Oct 30, 2018 (See Github)

Hey Adam,

Thanks for filing for a review, and apologies for the delay in getting back to you.

Some questions regarding the merged PR:

Regards

Comment by @annevk Oct 30, 2018 (See Github)

To answer the first question, other encodings don't have a BOM. If you encoded U+FEFF with, e.g., gb18030 it should never be removed.

Comment by @slightlyoff Oct 30, 2018 (See Github)

Thanks, @annevk! Is there an equivalent mechanism to handle potentially special characters in other encodings that could be pluggable?

Comment by @annevk Oct 30, 2018 (See Github)

No such characters exist to my knowledge and we intentionally don't want text encodings to be generic, we want everyone to migrate to UTF-8.

Comment by @ricea Oct 30, 2018 (See Github)

The GenericTransformStream mixin creates readable and writeable properties, but it doesn't appears as though internal aspects of these default streams are configurable. That is, the various flags one can pass to Stream constructors aren't available for configuration when instantiating one of these stream types. This makes certain built-in stream types "magically" faster and perhaps more flexible than userland stream types, violating layering expectations. Perhaps there's a way to configure them that I've just missed?

I'm not sure I understand the question correctly. They're not magically faster. Implementations can take advantage of the fact that the internals are not visible to user JS to optimise these operations. But they wouldn't be faster because of features unavailable to JS; they'd be faster because of the absence of the dynamic behaviour provided by JS.

They're also not more flexible. A transform stream is just a readable and a writable glued together, and that is what GenericTransformStream is trying to express. That glue can be written in JS or anything else.

The polyfill at https://github.com/GoogleChromeLabs/text-encode-transform-polyfill has full feature parity with Chrome's implementation, which I think demonstrates that layering has not been broken.

The queuing strategies with which the streams are constructed are intentionally not configurable. To avoid adding buffer bloat to pipes, I made the queues as small as possible. My philosophy is that queuing strategies should be designed together with the transformer, as mismatches can have surprising side-effects.

I'm not able to locate much in the way of example code outside of the encoding/streams/ WPT tests. There isn't a ton in the Explainer either. It'd be helpful to see examples that show problems being solved rather than just "how to use it here".

The first three demos at https://streams.spec.whatwg.org/demos/ use a polyfill of this functionality. Unfortunately it is out-of-date and doesn't match the final API. I am planning to update the demos to use the new polyfill, after which I can link them from the explainer.

It would good to have a longer example in the standard as well, maybe with the other long examples at the start of section 7. @annevk, what do you think?

The Explainer states that it's a non-goal to provide encodings other than UTF-8. Is it possible to create your own subtype to handle other encodings? If not, why not?

You can create a TransformStream for any encoding you want, but you have to provide your own implementation. Encoding to encodings other than UTF-8 is intentionally not exposed to the web platform. It wouldn't be a subtype of TextEncoderStream, but I don't know why you'd want to do that anyway.

Comment by @annevk Oct 30, 2018 (See Github)

(I'm always supportive of more examples being added.)

Discussed Nov 1, 2018 (See Github)

Alex: argument is that pieces are welded shut because that's a good thing. i'd like to see the data. I'll respond

Kenneth: [related to]... transform streams?

Alex: it's all very wired shut

Kenneth: the transport stream is more like an interface.

Alex: it's not clear that you'll get the same behavior - generating new streams out of existing streams... what s is the delegation mechanism...

Yves: what is the default encoding of streams in javascript?

Alex: ...

Alex: you can't just make a transform stream...

[...discussion on what feedback to leave...]

Alex: https://streams.spec.whatwg.org/#is-transform-stream

David: if there are issues with transform streams it's worth raising them.

Alex: i'd like to make this less magic. Will leave feedback

Comment by @slightlyoff Nov 28, 2018 (See Github)

Thanks all, and sorry for the slow replies.

@ricea:

So pulling on theGenericTransformStream thread and configurability, there's a lot of stuff that these designs inadvertantly expose:

  • the association with the implicit transformer means that something that's participating in this hierarchy has to somehow implement the magical isTransformStream. This type can't be meaninfully subclassed in script. Why is this hierarchy welded shut? What are the extension points to make a BYO transform stream type?
  • the choices you're making about buffer sizes might be right, but sophisticated users might take different views. Where are the extension points for those?
  • Not having configurable queueing strategies might, again, be the right default choice. But always? In all cases? What about for other types?
  • Not having up-to-date example code is a pretty bad smell. Can you ping this issue again when things are updated, preferably in the Explainer?

Regards

Comment by @domenic Nov 28, 2018 (See Github)

the association with the implicit transformer means that something that's participating in this hierarchy has to somehow implement the magical isTransformStream. This type can't be meaninfully subclassed in script. Why is this hierarchy welded shut? What are the extension points to make a BYO transform stream type?

This isn't correct. You can easily subclass TransformStream; try it in your browser to see! The hierarchy is open to anyone to extend.

There's nothing magical about IsTransformStream. It tests whether you are a TransformStream or subclass of TransformStream. Furthermore, it is only used within the getters of TransformStream itself, to ensure you can't do something like Object.getOwnPropertyDescriptor(TransformStream, "readable").get.call(someRandomObject) and have that poke at the internal memory addresses a browser uses to store the [[readable]] internal slot of TransformStream objects, offset from an arbitrary object address. That would indeed be bad to allow.

Comment by @ricea Nov 29, 2018 (See Github)

This type can't be meaninfully subclassed in script.

By "this type", do you mean TransformStreamDefaultController? It can't be meaningfully subclassed, because it is only useful when it is associated with a TransformStream object.

What are the extension points to make a BYO transform stream type?

A transform stream is simply an object that has a readable property that is a ReadableStream and a writable property that is a WritableStream. You can create one however you like. The TransformStream constructor provides one convenient method to do so.

the choices you're making about buffer sizes might be right, but sophisticated users might take different views. Where are the extension points for those?

You can always add buffering to a pipe using identity transforms. For example, suppose you want a megabyte of buffering in front of a TextDecoderStream:

const bufferingIdentityTransform =
    new TransformStream({}, new ByteLengthQueuingStrategy({highWaterMark: 1024 * 1024}));

response.body
  .pipeThrough(bufferingIdentityTransform)
  .pipeThrough(new TextDecoderStream())
  .pipeTo(mySink);

Not having configurable queueing strategies might, again, be the right default choice. But always? In all cases? What about for other types?

There might be other platform transforms for which a configurable queuing strategy would make sense. There would be a price to pay in that the user-supplied size() function would be called inside the implementation of the transform, precluding some optimisations and introducing risks of re-entrancy. For some transforms that trade-off might be worthwhile.

Not having up-to-date example code is a pretty bad smell.

Indeed. I will let you know when they are updated.

Discussed Dec 1, 2018 (See Github)

Kenneth: Looking at issues around typing and TransformStreams.

Sangwhan: some people have issues with UTF-8 being the only encoding.

Yves: last telcon, we discussed the fact that natively js is ucs2 not utf8 - so the current name might be misleading

Kenneth: up-to-date example code?

Alex: no constructor so can't subclass?

Kenneth: domenic says you can

Alex: still an issue with subclassability - have followed up on the issue

Comment by @slightlyoff Dec 4, 2018 (See Github)

Thanks for the note on subclassing TransformStream, @domenic! Glad to see it works; I was more concerned with TransformStreamDefaultController, tho. Do you think the arguments here around not enabling subclassing of it hold water?

@ricea: yes, I was surprised to see:

class TransformStreamDefaultController {
  constructor() // always throws
  ...

Given what you've shown above re: identity streams, what's the argument for preventing subclassing?

On updated, examples, thanks. Looking forward to it.

Comment by @ricea Dec 5, 2018 (See Github)

Given what you've shown above re: identity streams, what's the argument for preventing subclassing?

A TransformStreamDefaultController and a TransformStream are mutually associated for the lifetime of the TransformStream. TransformStreamDefaultController objects are only created during construction of TransformStream objects. A TransformStreamDefaultController that was not associated with a TransformStream would be useless. @domenic wrote more about the principles of the design in The Revealing Constructor Pattern.

Suppose hypothetically that we permitted constructing a "detached" TransformStreamDefaultController. We could add a way to pass it into the TransformStream constructor to have it "attached" to the newly-constructed TransformStream. What extra capabilities would this afford the web developer? They could subclass TransformStreamDefaultController and override enqueue(), error() and terminate() methods. However, they are the only person who will be calling enqueue(), error() and terminate(). They control all the call sites, so changing the call sites would be a much simpler way to accomplish the same thing. If they didn't want to change the call sites, they could wrap the controller object passed to their start() method, or modify the controller object directly.

So preventing subclassing reduces complexity with no loss of generality.

Comment by @kenchris Dec 11, 2018 (See Github)

@domenic if you are supposed to subclass TransformStream shouldn't the examples in the stream spec do that? eg. https://streams.spec.whatwg.org/#example-ts-lipfuzz

Comment by @domenic Dec 11, 2018 (See Github)

You're not supposed to; you can. Sometimes it's useful, like when you want to add extra properties and methods! In those examples it's not, because we just want to transform some data.

Discussed Jan 1, 2019 (See Github)

Kenneth: alex's comment about subclassing / not subclassing. They say you should subclass...

Alex: I'll provide feedback today

Discussed Jan 1, 2019 (See Github)

Peter: currently pending external feedback.

Travis: Have an outstanding 'subclassing' issue. Should probably wait to the F2F.

Comment by @slightlyoff Jan 8, 2019 (See Github)

The argument that "we can't figure out why a web developer would ever want something" (a something that works against our design principles and general design goals) is not particularly compelling. The point of extensibility is to enable developers to do the set of things we haven't yet thought of, rather than to presume that we're omniscient.

The argument from consistency, at least, trumps one-off restrictions and design oddities. Let's make this subclassable.

Comment by @domenic Jan 8, 2019 (See Github)

I think you are over-applying a general principle, which doesn't make sense in the case of the revealing constructor pattern. Those revealed capabilities aren't really a proper "class"; they're a holder for some capabilities of another class (in this case TransformStream), which itself is subclassable and extensible in the desired way.

This ties in to how you have misquoted @ricea's argument. It isn't "we can't figure out why a web developer would want to do something". It's "preventing subclassing reduces complexity with no loss of generality": i.e., anything a web developer might want to do, they can do in a much simpler way with other techniques.

Comment by @cynthia May 21, 2019 (See Github)

@kenchris and I discussed during the Iceland F2F.

As we see it, the fact on whether or not we force inheritance in this particular case is not a significant blocker and we'd like to see this feature move forward. Thanks for bringing this to our attention and bearing with us for the extremely long discussion.