02:18 | <zero-one> | this might be a dumb question, but at step 4 of https://url.spec.whatwg.org/#host-parsing , what exactly is meant by "UTF-8 decode"? |
02:18 | <zero-one> | decoded into... UTF32? i.e, the codepoint value? |
02:19 | <zero-one> | the link to the encoding spec doesn't really clear it up for me |
02:36 | <Domenic> | It means "run the algorithm given at that link" |
02:36 | <Jeremy Roman> | It's not super explicit; the algorithm at that link yields an I/O queue of scalar values |
02:37 | <zero-one> | yeah, but the algorithm just says that UTF-8 has a decoder, it doesn't explain what that means |
02:37 | <zero-one> | and to run it |
02:37 | <Domenic> | Well, if you try to implement it, I wonder where you get stuck |
02:37 | <Domenic> | You need to keep following the links |
02:38 | <Domenic> | https://encoding.spec.whatwg.org/#concept-encoding-run , https://encoding.spec.whatwg.org/#concept-encoding-process , etc. |
02:38 | <Jeremy Roman> | at no point does it ever actually construct a string though |
02:38 | <Jeremy Roman> | (that I could find) |
02:38 | <Domenic> | Eventually you end up running the UTF-8 decoder's handler. That is defined at https://encoding.spec.whatwg.org/#ref-for-handler%E2%91%A3 . |
02:40 | <Jeremy Roman> | process an item pushes said code points into the I/O queue output, and process a queue does so until it fails or reaches the end of the input queue |
02:41 | <Jeremy Roman> | but UTF-8 decode simply returns that I/O queue of scalar values (code points) |
02:41 | <Jeremy Roman> | whereas infra describes a string as consisting of code units and then handwaves at ECMA-262 to say that if you have code units you can think of them as code points |
02:43 | <Jeremy Roman> | even though afaict the linked section of ECMA-262 only actually describes how to interpret code units as code points (and not the reverse), the intent seems to be that a conversion takes place and the URL spec is operating on infra strings |
02:43 | <Jeremy Roman> | (because nothing else would be sensible) |
02:43 | <Domenic> | The conversion is defined here: https://encoding.spec.whatwg.org/#from-i-o-queue-convert . I think it is intended to be explicit, but it used to be implicit, and maybe not all cases were updated... |
02:44 | <Domenic> | Or maybe https://github.com/whatwg/infra/issues/319 is saying that it's meant to be implicit, but not defined yet |
02:45 | <Andreu Botella> | Yeah, the idea was to have implicit conversions between I/O streams and strings/lists, but looking back at it I'm not sure that's a great idea anymore |
02:45 | <Andreu Botella> | That was my first relatively large spec contribution 😬 |
02:48 | <Jeremy Roman> | I think there probably should be an explicit algorithm that does the UTF-16 encode on the scalar values; just kinda saying that somehow you end up with a string that UTF-16 decodes into the same scalar values feels magical to me |
02:48 | <Jeremy Roman> | though dunno if it's a big deal |
02:48 | <zero-one> | as an amateur trying to implement all these specs, i wonder why we don't just assume UTF-8-encoded input for the URL parser |
02:48 | <zero-one> | given that you have to do the conversion several times anyway |
02:49 | <zero-one> | and in 2023, i have to imagine that implementations not using UTF-8 anyway are few and far between |
02:49 | <Jeremy Roman> | afaik implementations actually pass UTF-16 to the URL parser quite frequently |
02:49 | <Jeremy Roman> | because ECMAScript defines strings as having 16-bit code units |
02:50 | <Jeremy Roman> | even though that's now what's spoken on the wire |
02:51 | <Jeremy Roman> | https://infra.spec.whatwg.org/#strings |
02:52 | <zero-one> | i guess if the URL parser is tightly coupled to a js engine, it makes more sense |
02:52 | <zero-one> | but if there's an IO boundary there, then idk |
02:55 | <Jeremy Roman> | short version is the use of UTF-8 on the wire and 16-bit code units in ECMAScript means there's nothing that doesn't end up with some amount of transcoding around the place |
02:55 | <Andreu Botella> | In the case of the URL parser, the encoding argument (which is the HTML page's encoding) affects how query strings are encoded, but not the rest of the URL |
02:55 | <Jeremy Roman> | at least for valid sequences of Unicode scalar values that's lossless |
02:55 | <Andreu Botella> | The reason behind this is probably historical |
02:56 | <zero-one> | In the case of the URL parser, the |
02:59 | <Jeremy Roman> | The reason behind this is probably historical |
03:03 | <Andreu Botella> | I just remembered this ISO-2022-JP weirdness that affects query string percent-encoding: https://encoding.spec.whatwg.org/#pit-of-iso-2022-jp |
08:53 | <Ms2ger> | TabAtkins:
|
14:13 | <annevk> | I think implicit conversion is still fine. Though if we wanted to avoid that we'd need a bunch of wrapper algorithms instead. Putting the burden on each caller to do the conversion seems bad. Defining the URL parser as byte sequence -> URL is an interesting idea and it definitely makes sense for implementations. Conceptually though scalar value string -> URL seems a lot cleaner. And as most formats tend to do a byte sequence -> scalar value string conversion early on it's probably also better for callers. Though again that might only be true in concept as implementations could certainly store those URL inputs as byte sequences too. |
14:21 | <Ms2ger> |
|
14:26 | <annevk> | WebIDLpedia \o/ https://dontcallmedom.github.io/webidlpedia/ |