| 05:12 | <MikeSmith> | annevk: at https://encoding.spec.whatwg.org/#ref-for-ascii-code-point the Encoding spec says this: |
| 05:12 | <MikeSmith> | > These are “ASCII-incompatible” encodings and other than ISO-2022-JP, UTF-16BE, and UTF-16LE, which are unfortunately required due to deployed content, they are not supported. |
| 05:13 | <MikeSmith> | ...while at https://html.spec.whatwg.org/multipage/infrastructure.html#ascii-compatible-encoding the HTML spec says this: |
| 05:13 | <MikeSmith> | > An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding. [ENCODING] |
| 05:26 | <MikeSmith> | .. |
| 05:27 | <MikeSmith> | So per that HTML spec language, ISO-2022-JP is an ASCII-compatible encoding |
| 06:20 | <MikeSmith> | hmm, actually ISO-2022-JP is actually an ASCII-compatible encoding, isn’t it? |
| 06:21 | <MikeSmith> | so that “other than ISO-2022-JP” part of the language in the Encoding spec should be dropped, shouldn’t it? |
| 07:45 | <annevk> | MikeSmith: it’s not due to the escapes |
| 07:45 | <annevk> | MikeSmith: we should prolly harmonize that language in HTML though |
| 07:51 | <MikeSmith> | annevk: Yeah I am asking because we have some code in the validator.nu parser that does a “is ASCII-compatible” check |
| 07:51 | <MikeSmith> | but I think it’s based on old spec language |
| 07:52 | <MikeSmith> | I think the current equivalent in the HTML spec is just the explicit checks for UTF-16 |
| 07:53 | <MikeSmith> | anyway, I need to check that the code is actually doing something that current spec requires (and not something that it used to require and doesn’t now) |
| 08:40 | <annevk> | MikeSmith: HTML should just drop ASCII-compatible at this point; not sure why we kept it when we added UTF-16 encoding as a thing |
| 08:42 | <annevk> | MikeSmith: and I guess HTML's "UTF-16 encoding" could move to Encoding, but would like to land the refactoring PR for Encoding first |
| 09:22 | <MikeSmith> | annevk: refactoring PR is the “Rename Encoding's "streams" to "I/O queues"” PR? |
| 09:23 | <annevk> | MikeSmith: yeah |
| 09:23 | <MikeSmith> | k |
| 09:24 | <MikeSmith> | annevk: by the way, the specific check in the validator.nu code that I’m wondering about does this: |
| 09:24 | <annevk> | andreubotella: btw, realized that Domenic is out this week so do you want to wait or potentially do some tidying up later? |
| 09:25 | <MikeSmith> | > The encoding “foo” is not an ASCII superset and, therefore, cannot be used in an internal encoding declaration. Continuing the sniffing algorithm |
| 09:25 | <MikeSmith> | ..in the the meta-scan code |
| 09:26 | <MikeSmith> | actually, it’s doing the same thing for the fully-parsed case too |
| 09:27 | <MikeSmith> | > Internal encoding declaration specified “foo”, which is not an ASCII superset. Not changing the encoding. |
| 09:29 | <MikeSmith> | since there’s earlier code that explicitly checks for UTF-16, then as far as I can see, that “not an ASCII superset. Not changing the encoding” would only get reached if the encoding is ISO-2022-JP and if ISO-2022-JP is considered to not be an ASCII superset |
| 09:31 | <MikeSmith> | ah |
| 09:31 | <MikeSmith> | that is this: |
| 09:31 | <MikeSmith> | > If the encoding that is already being used to interpret the input stream is a UTF-16 encoding, then set the confidence to certain and return. The new encoding is ignored |
| 09:32 | <MikeSmith> | ...except that the spec says to do that ignore return only for UTF-16 encodings explictly (not for “not an ASCII superset” encodings) |
| 09:34 | <annevk> | Yeah, note that the specification has seen some refactoring already |
| 09:38 | <annevk> | MikeSmith: https://github.com/whatwg/html/commit/a73180679a40fce96b8e8fb6dfa5815a9bce30eb is probably of interest |
| 09:41 | MikeSmith | looks |
| 09:41 | <MikeSmith> | annevk: ah yeah that’s it |
| 09:41 | <MikeSmith> | 2015 |
| 09:42 | <MikeSmith> | I am kind of surprised how far out of conformance the validator.nu Java code is with the spec |
| 09:43 | <MikeSmith> | I mean specifically the encodings-handling code |
| 09:44 | <MikeSmith> | since it’s used for Firefox too, I would expect that’d necessarily mean that Firefox was also way out of conformance with the spec as far as encodings handling |
| 09:45 | <annevk> | MikeSmith: is it actually non-compliant though? Only checking for UTF-16 seems correct |
| 09:46 | <MikeSmith> | that is just one place I have found where the Java code is non-conforming |
| 09:46 | <annevk> | To stress the point a bit, the Encoding Standard's definition of ASCII-incompatible is completely non-normative |
| 09:46 | <MikeSmith> | OK |
| 09:47 | annevk | wonders if the big OK represents an annoyed MikeSmith 😊 |
| 09:47 | <MikeSmith> | no, no — not annoyed at all |
| 09:48 | <MikeSmith> | anyway, another place that the Java code does not match the spec is that it implements the Charset Alias Matching thing rather than just trim-leading-trailing whitespace |
| 09:49 | <MikeSmith> | so I am kind of beginning to suspect that this is a part of the Java source that doesn’t actually get used in the Firefox code |
| 09:50 | <annevk> | oh yeah, that's bad |
| 09:50 | <annevk> | Pretty sure that's not in Firefox indeed |
| 09:50 | <annevk> | I wonder how many more times I will see Charset Alias Matching referenced in my life |
| 09:50 | <MikeSmith> | yeah, I think Henri must have separate C++ source for this |
| 09:50 | <MikeSmith> | haha |
| 09:51 | <MikeSmith> | more than you would like, I’m sure |
| 09:52 | <MikeSmith> | oh, actually I already know one specific place where the Firefox code does something very different from the Java code: the "replacement" encoding name/label |
| 09:52 | <MikeSmith> | there is zero code in the Java sources for dealing with the "replacement" encoding |
| 09:52 | <MikeSmith> | ...yet Firefox handles it per-spec |
| 09:53 | <annevk> | R.I.P. He rid the web from Charset Alias Matching. OK chap. |
| 09:57 | <MikeSmith> | hahah |
| 10:05 | <andreubotella> | annevk: oh, I didn't know that. Let's merge now and fix later, then |
| 10:07 | <annevk> | andreubotella: sounds good, doing a final round of nits now |
| 10:07 | <andreubotella> | 👍 |
| 12:08 | <noamr> | annevk: hi, I've updated https://github.com/whatwg/html/pull/5574 to account for same-origin concerns as discussed. |
| 20:28 | <EveryOS> | Today I posted to the wicg discourse 1400 words worth of the most stupid, unrealistic idea I've ever had. At least it has not been deleted, so that's a plus... |