WHATWG on 2023-03-01

02:18	<zero-one>	this might be a dumb question, but at step 4 of https://url.spec.whatwg.org/#host-parsing , what exactly is meant by "UTF-8 decode"?
02:18	<zero-one>	decoded into... UTF32? i.e, the codepoint value?
02:19	<zero-one>	the link to the encoding spec doesn't really clear it up for me
02:36	<Domenic>	It means "run the algorithm given at that link"
02:36	<Jeremy Roman>	It's not super explicit; the algorithm at that link yields an I/O queue of scalar values
02:37	<zero-one>	yeah, but the algorithm just says that UTF-8 has a decoder, it doesn't explain what that means
02:37	<zero-one>	and to run it
02:37	<Domenic>	Well, if you try to implement it, I wonder where you get stuck
02:37	<Domenic>	You need to keep following the links
02:38	<Domenic>	https://encoding.spec.whatwg.org/#concept-encoding-run , https://encoding.spec.whatwg.org/#concept-encoding-process , etc.
02:38	<Jeremy Roman>	at no point does it ever actually construct a string though
02:38	<Jeremy Roman>	(that I could find)
02:38	<Domenic>	Eventually you end up running the UTF-8 decoder's handler. That is defined at https://encoding.spec.whatwg.org/#ref-for-handler%E2%91%A3 .
02:40	<Jeremy Roman>	process an item pushes said code points into the I/O queue output, and process a queue does so until it fails or reaches the end of the input queue
02:41	<Jeremy Roman>	but UTF-8 decode simply returns that I/O queue of scalar values (code points)
02:41	<Jeremy Roman>	whereas infra describes a string as consisting of code units and then handwaves at ECMA-262 to say that if you have code units you can think of them as code points
02:43	<Jeremy Roman>	even though afaict the linked section of ECMA-262 only actually describes how to interpret code units as code points (and not the reverse), the intent seems to be that a conversion takes place and the URL spec is operating on infra strings
02:43	<Jeremy Roman>	(because nothing else would be sensible)
02:43	<Domenic>	The conversion is defined here: https://encoding.spec.whatwg.org/#from-i-o-queue-convert . I think it is intended to be explicit, but it used to be implicit, and maybe not all cases were updated...
02:44	<Domenic>	Or maybe https://github.com/whatwg/infra/issues/319 is saying that it's meant to be implicit, but not defined yet
02:45	<Andreu Botella>	Yeah, the idea was to have implicit conversions between I/O streams and strings/lists, but looking back at it I'm not sure that's a great idea anymore
02:45	<Andreu Botella>	That was my first relatively large spec contribution 😬
02:48	<Jeremy Roman>	I think there probably should be an explicit algorithm that does the UTF-16 encode on the scalar values; just kinda saying that somehow you end up with a string that UTF-16 decodes into the same scalar values feels magical to me
02:48	<Jeremy Roman>	though dunno if it's a big deal
02:48	<zero-one>	as an amateur trying to implement all these specs, i wonder why we don't just assume UTF-8-encoded input for the URL parser
02:48	<zero-one>	given that you have to do the conversion several times anyway
02:49	<zero-one>	and in 2023, i have to imagine that implementations not using UTF-8 anyway are few and far between
02:49	<Jeremy Roman>	afaik implementations actually pass UTF-16 to the URL parser quite frequently
02:49	<Jeremy Roman>	because ECMAScript defines strings as having 16-bit code units
02:50	<Jeremy Roman>	even though that's now what's spoken on the wire
02:51	<Jeremy Roman>	https://infra.spec.whatwg.org/#strings
02:52	<zero-one>	i guess if the URL parser is tightly coupled to a js engine, it makes more sense
02:52	<zero-one>	but if there's an IO boundary there, then idk
02:55	<Jeremy Roman>	short version is the use of UTF-8 on the wire and 16-bit code units in ECMAScript means there's nothing that doesn't end up with some amount of transcoding around the place
02:55	<Andreu Botella>	In the case of the URL parser, the `encoding` argument (which is the HTML page's encoding) affects how query strings are encoded, but not the rest of the URL
02:55	<Jeremy Roman>	at least for valid sequences of Unicode scalar values that's lossless
02:55	<Andreu Botella>	The reason behind this is probably historical
02:56	<zero-one>	In the case of the URL parser, the `encoding` argument (which is the HTML page's encoding) affects how query strings are encoded, but not the rest of the URL ah
02:59	<Jeremy Roman>	The reason behind this is probably historical good old hysterical raisins
03:03	<Andreu Botella>	I just remembered this ISO-2022-JP weirdness that affects query string percent-encoding: https://encoding.spec.whatwg.org/#pit-of-iso-2022-jp
08:53	<Ms2ger>	TabAtkins: WARNING: Error downloading anchors/anchors-ur.data, full error was: WARNING: Error downloading anchors/anchors-vi.data, full error was: Updated 2398/2400, 1 errors... FATAL ERROR: Done, but there were 2 errors (of 2400 total) in downloading or saving. Run `bikeshed update` again to retry.
14:13	<annevk>	I think implicit conversion is still fine. Though if we wanted to avoid that we'd need a bunch of wrapper algorithms instead. Putting the burden on each caller to do the conversion seems bad. Defining the URL parser as byte sequence -> URL is an interesting idea and it definitely makes sense for implementations. Conceptually though scalar value string -> URL seems a lot cleaner. And as most formats tend to do a byte sequence -> scalar value string conversion early on it's probably also better for callers. Though again that might only be true in concept as implementations could certainly store those URL inputs as byte sequences too.
14:21	<Ms2ger>	TabAtkins: WARNING: Error downloading anchors/anchors-ur.data, full error was: WARNING: Error downloading anchors/anchors-vi.data, full error was: Updated 2398/2400, 1 errors... FATAL ERROR: Done, but there were 2 errors (of 2400 total) in downloading or saving. Run `bikeshed update` again to retry. Not sure what I did, but I managed an update and now everything seems to work again
14:26	<annevk>	WebIDLpedia \o/ https://dontcallmedom.github.io/webidlpedia/