#whatwg on 2015-09-26

10:00	<qq[IrcCity]>	hello. where can I find people who care about bugs in processing “text/plain” by major modern browsers? more details here http://www.superstructure.info/browser/compromised/toxic-sniffing.html (the story is only partially written, but enough to see bugs).
10:07	<annevk>	qq[IrcCity]: "bugs"
10:07	<annevk>	qq[IrcCity]: see https://mimesniff.spec.whatwg.org/
10:09	<annevk>	qq[IrcCity]: if your problem is with BOM being more important than other encoding declarations, that's https://encoding.spec.whatwg.org/
10:09	<annevk>	qq[IrcCity]: also not considered a bug
10:15	<qq[IrcCity]>	annevk: a text wall too high; didn’t find anyting relevant to my case but mentioning of one https://mimesniff.spec.whatwg.org/#no-sniff-flag but without much detail.
10:15	<qq[IrcCity]>	could you give a link to provisions that are relevant to “text/plain”?
10:16	<annevk>	qq[IrcCity]: it's very much unclear what "processing bug" you're talking about so I don't know
10:16	<annevk>	qq[IrcCity]: if it's indeed mostly about encodings, I recommend reading the Encoding Standard
10:16	<annevk>	qq[IrcCity]: you shouldn't be using non-utf-8 encodings anyway
10:16	<qq[IrcCity]>	damn. who approved it?
10:16	<annevk>	qq[IrcCity]: and BOM trumps Content-Type
10:17	<annevk>	qq[IrcCity]: not sure what you mean
10:17	<qq[IrcCity]>	which standard-making body declared “you shouldn't be using non-utf-8 encodings anyway”?
10:18	<annevk>	qq[IrcCity]: WHATWG, and I guess W3C did too since they copied it over
10:20	<qq[IrcCity]>	annevk: W3C standard of HTML5 doesn’t contain such rubbish. it ony specified that if the document is seemingly Unicode (isn’t encoded in an octet-oriented code page), then BOM takes precedence.
10:21	<annevk>	qq[IrcCity]: while out of date and a poor reference, http://www.w3.org/TR/encoding/ most certainly has the same requirement as https://encoding.spec.whatwg.org/ with regards to utf-8
10:22	<annevk>	qq[IrcCity]: also, http://www.w3.org/TR/html5/references.html#refsENCODING
10:22	<annevk>	qq[IrcCity]: anyway, I wouldn't recommend reading W3C copies, they're not what's being implemented
10:24	<qq[IrcCity]>	http://www.w3.org/TR/encoding/ doesn’t state anything is deprecated, restricted to certain circumstances, or so. all encodings are, theoretically, permitted.
10:25	<annevk>	qq[IrcCity]: sigh
10:25	<annevk>	qq[IrcCity]: 'Authors must use the utf-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.'
10:26	<qq[IrcCity]>	again, HTML5 is about “text/html”. annevk, do you understand the word “text/plain”?
10:26	<annevk>	qq[IrcCity]: HTML defines how text/plain works
10:26	<annevk>	qq[IrcCity]: and text/plain is not a word
10:27	<qq[IrcCity]>	RoTFL. these are different media types.
10:27	<annevk>	qq[IrcCity]: sure, but it's rendered using the HTML parser, see "Page load processing model for text files"
10:28	<annevk>	qq[IrcCity]: in the HTML standard
10:28	<qq[IrcCity]>	annevk, I read in HTML5 about parser.
10:28	<annevk>	qq[IrcCity]: irrespective of that, the Encoding standard applies to all MIME types
10:28	<qq[IrcCity]>	it’s a stage after decoding, not before.
10:29	<annevk>	qq[IrcCity]: actually, part of the HTML parser handles decoding
10:29	<qq[IrcCity]>	HTML decoding rules apply only to HTML.
10:30	<qq[IrcCity]>	and HTML5 make a special provision abour “irrelevent” confidence.
10:30	<qq[IrcCity]>	when browser operates in its internal text encoding.
10:31	<annevk>	I see, well, have a nice day
10:47	<qq[IrcCity]>	annevk, I think W3C people missed your out-of-context “Authors must use the utf-8” thing in CR-encoding-20140916, and possibly will be amazed with such your “novel” ideas as applying BOM sniffing to all media types including binaries ☺
10:47	<qq[IrcCity]>	may I quote this chat in mailing lists?
11:32	<qq[IrcCity]>	given the channel has logs publicly readable at krijnhoetmer.nl, I’ll proceed to quote the conversation without explicit permission.
11:36	<annevk>	Hah, you're in for a surprise ;-)
11:37	<annevk>	But yeah, feel free to quote. I tried to help you out, but it seems like you had some set of answers already
11:44	<qq[IrcCity]>	sure, I’m conservative. my “set of answers” is based on original HTTP/1.1, not diluted by accommodation to idiocy.
11:44	<annevk>	qq[IrcCity]: please stay civil
11:46	<qq[IrcCity]>	naïve people here. admins of httpd software were told for 17 years (since RFC 2068 till HTML5) they must specify Content-Type: text/whichever, charset=actual. many of them ignored it.
11:47	<qq[IrcCity]>	now you make BOM sniffing mandatory and hope it would be universal solution ☺
11:50	<annevk>	sounds about right
11:52	<qq[IrcCity]>	and naïve Prof. Dürst who boasts he pioneered reliability of heuristic UTF-8 detection as early as in 1997, but seemingly unaware how modern broswers “detect” UTF-8 :D
11:56	<qq[IrcCity]>	I would look to see what Martin J. Dürst would do when, eventually, learned about the algorithm promulgated by Anne van Kesteren and company. what would he say to Anne.
11:57	<annevk>	Pretty sure he knows about it
11:58	<qq[IrcCity]>	if the stream starts with «\357\273\277», then it’s damn UTF-8, no matter the fourth byte is \377 :D
12:18	<qq[IrcCity]>	one big problem: major browsers ceased to honour values in Content-Type, and not only for text/html. and two narrower cases: 1. broken text/plain (already),
12:18	<qq[IrcCity]>	2. such novelties as “the Encoding standard applies to all MIME types” (yet to happen to application/* and so on).
12:34	<nox>	How is text/plain broken?
12:36	<qq[IrcCity]>	nox, did you see test cases at http://www.superstructure.info/browser/compromised/toxic-sniffing.html#better ?
12:36	<qq[IrcCity]>	one should enter \357\273\277 and so on manually, of course.
12:39	<nox>	I don't understand.
12:39	<nox>	What spec does that test follow?
12:42	<qq[IrcCity]>	nox, testing compliancy with RFC 2616 was initially in mind. RFC 7231 is somewhat vague about Content-Type, but major browsers defy even its relaxed provisions.
12:42	<nox>	Where is the accommodation to idiocy?
12:44	<qq[IrcCity]>	nox: overriding explicitly serviced «charset=» with own guesses. BTW, what is no-sniff-flag?
12:48	<qq[IrcCity]>	in other words, Google Chrome tells its user that one lamer qq[IrcCity] mistook «charset=» and it (Google) knows better what was intended meaning.
12:49	<qq[IrcCity]>	even when the text its claim to be UTF-8 has \377 in the fourth octet.
12:49	<nox>	What is that text?
12:49	<nox>	The page is lacking many information. What was the actual charset you intended to transmit?
12:51	<qq[IrcCity]>	nox, there is a simpler test case at http://course.irccity.ru/ya-yu-9-amp.txt but about toxic UTF-16. two minutes, about to made the same for UTF-8.
12:53	<nox>	qq[IrcCity]: And why would \377 as the fourth octet change things btw?
12:53	<qq[IrcCity]>	from the point of view of RFC 2616, it’s at all irrelevant.
12:54	<qq[IrcCity]>	but it can show madness of BOM sniffing better.
12:55	<nox>	I see.
12:56	<nox>	Interestingly, forcing charset to Windows-1251 on that last link doesn't change anything in Safari.
12:57	<qq[IrcCity]>	http://course.irccity.ru/p-guillemet-yi-ya.txt shows toxic UTF-8.
12:57	<nox>	qq[IrcCity]: Then again, the question is whether your examples are the majority, or whether actually honouring 'charset' breaks more things.
12:58	<qq[IrcCity]>	examples are not about majority. they are about predictability.
12:59	<nox>	The rules you call bugs are predictable.
13:01	<qq[IrcCity]>	they are predictable only since some guys found certain WhatWG and developers agreed to follow recommendations. they aren’t predictable in the old world where protocols did matter.
13:02	<nox>	You do realise that the specs are how they are because actually the majority never cared about honouring 'charset', even in the old world?
13:02	<qq[IrcCity]>	nox, which majority? Russian-speaking sites mostly cared.
13:03	<qq[IrcCity]>	and nobody can predict new things will WhatWG invent tomorrow: UTF-8 sniffing for octet-stream or whatever.
13:03	<nox>	Dürst in a mail linked on the page says "Yes, the iso-8859-1 'default' was invalidated because there were millions and millions of documents for which it would have been wrong, especially in Eastern Europe and Asia."
13:03	<nox>	Are you talking about something else?
13:03	<qq[IrcCity]>	nox, about overriding explicitly serviced charset, again.
13:04	<qq[IrcCity]>	not old HTTP default.
13:07	<nox>	So I guess this actually broke a Russian-speaking site somewhere?
13:07	<qq[IrcCity]>	what??
13:07	<nox>	Not honouring Content-Type.
13:08	<qq[IrcCity]>	what do you speaking about? 90%+ pages contained charset= if not in Content-Type, then in HTML <meta> in the worst case.
13:10	<qq[IrcCity]>	if I am barred from saying п»їя in Windows-1251 (although all four characters pertain to the codepage), then the protocol is not honoured anymore. not a problem with HTML that may not start…
13:10	<qq[IrcCity]>	… from arbitrary characters anyway, but a problem for text/plain.
13:12	<nox>	My question is, is that an actual bug that actually breaks stuff, or is that just a theoretical problem that doesn't break anything in practice?
13:12	<nox>	You are not barred from saying such a thing in a text/plain document, you are barred from beginning a document with such a thing.
13:12	<nox>	No?
13:14	<nox>	Did you look at RFC5987 btw?
13:18	<nox>	Also, https://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-26#section-3.1.1.5
13:19	<qq[IrcCity]>	nox, how is RFC5987 relevant to the dispute? it is related to our case (RFC 7231) like RFC 1522 (i18n of headers) is related to RFC 1521 (i18n of bodies).
13:20	<nox>	qq[IrcCity]: Yeah disregard that one, lost myself in tabs.
13:20	<nox>	Meant to ask whether httpbis had some changes planned in that area, apparently no.
13:21	<qq[IrcCity]>	nox: so you agree that browsers defy HTTP semantics 3.1.1.5, aren’t you?
13:22	<nox>	I agree, but I'm not sure it's actually important.
13:23	<qq[IrcCity]>	I’m not aware anyone complained before me about that.
13:24	<qq[IrcCity]>	I made Codepage Explorer just for the case, to show an actual application that can be broken by unlucky combination of octets.
13:25	<nox>	It's a nice catch, but I'm afraid you'll just be told that it's not actually a problem in real world and thus nothing will be changed.
13:27	<qq[IrcCity]>	not a nice thing for me, since my trust in browsers ended abruptly. this is not that Internet I was accustomed to.
13:28	<qq[IrcCity]>	nowadays browsers lie you twice more than despisable Windows lied users in 1990s.
16:17	<annevk>	Ugh
16:18	<annevk>	ECMAScript still requires that `var x = new ArrayBuffer(10); postMessage(x, "*", [x]); console.log(x.byteLength)` throws, but no implementation does that
16:36	<annevk>	Oh, actually, I'm mistaken
17:33	<Domenic>	Hmm I think it does
17:34	<Domenic>	Last I checked TC39 still wants to try that. And no implementers want to try that. So awesome.
23:27	<MikeSmith>	I don't understand how it's possible that I never knew about http://devdocs.io/ before now
23:27	<MikeSmith>	do other people here know about it already?
23:28	<MikeSmith>	it seems extremely well done, as far as putting some very good UI/UX around aggregated docs from a bunch of different sources (e.g., MDN, but a ton of other stuff as well)
23:40	<jgraham>	Oh, I had heard of that but now it has Rust docs
23:40	<jgraham>	Seems like it could be more convenient than trying to remember where it installs them and start a web server