10:00
<qq[IrcCity]>
hello. where can I find people who care about bugs in processing “text/plain” by major modern browsers? more details here http://www.superstructure.info/browser/compromised/toxic-sniffing.html (the story is only partially written, but enough to see bugs).
10:07
<annevk>
qq[IrcCity]: "bugs"
10:07
<annevk>
qq[IrcCity]: see https://mimesniff.spec.whatwg.org/
10:09
<annevk>
qq[IrcCity]: if your problem is with BOM being more important than other encoding declarations, that's https://encoding.spec.whatwg.org/
10:09
<annevk>
qq[IrcCity]: also not considered a bug
10:15
<qq[IrcCity]>
annevk: a text wall too high; didn’t find anyting relevant to my case but mentioning of one https://mimesniff.spec.whatwg.org/#no-sniff-flag but without much detail.
10:15
<qq[IrcCity]>
could you give a link to provisions that are relevant to “text/plain”?
10:16
<annevk>
qq[IrcCity]: it's very much unclear what "processing bug" you're talking about so I don't know
10:16
<annevk>
qq[IrcCity]: if it's indeed mostly about encodings, I recommend reading the Encoding Standard
10:16
<annevk>
qq[IrcCity]: you shouldn't be using non-utf-8 encodings anyway
10:16
<qq[IrcCity]>
damn. who approved it?
10:16
<annevk>
qq[IrcCity]: and BOM trumps Content-Type
10:17
<annevk>
qq[IrcCity]: not sure what you mean
10:17
<qq[IrcCity]>
which standard-making body declared “you shouldn't be using non-utf-8 encodings anyway”?
10:18
<annevk>
qq[IrcCity]: WHATWG, and I guess W3C did too since they copied it over
10:20
<qq[IrcCity]>
annevk: W3C standard of HTML5 doesn’t contain such rubbish. it ony specified that if the document is seemingly Unicode (isn’t encoded in an octet-oriented code page), then BOM takes precedence.
10:21
<annevk>
qq[IrcCity]: while out of date and a poor reference, http://www.w3.org/TR/encoding/ most certainly has the same requirement as https://encoding.spec.whatwg.org/ with regards to utf-8
10:22
<annevk>
qq[IrcCity]: also, http://www.w3.org/TR/html5/references.html#refsENCODING
10:22
<annevk>
qq[IrcCity]: anyway, I wouldn't recommend reading W3C copies, they're not what's being implemented
10:24
<qq[IrcCity]>
http://www.w3.org/TR/encoding/ doesn’t state anything is deprecated, restricted to certain circumstances, or so. all encodings are, theoretically, permitted.
10:25
<annevk>
qq[IrcCity]: *sigh*
10:25
<annevk>
qq[IrcCity]: 'Authors must use the utf-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.'
10:26
<qq[IrcCity]>
again, HTML5 is about “text/html”. annevk, do you understand the word “text/plain”?
10:26
<annevk>
qq[IrcCity]: HTML defines how text/plain works
10:26
<annevk>
qq[IrcCity]: and text/plain is not a word
10:27
<qq[IrcCity]>
RoTFL. these are different media types.
10:27
<annevk>
qq[IrcCity]: sure, but it's rendered using the HTML parser, see "Page load processing model for text files"
10:28
<annevk>
qq[IrcCity]: in the HTML standard
10:28
<qq[IrcCity]>
annevk, I read in HTML5 about parser.
10:28
<annevk>
qq[IrcCity]: irrespective of that, the Encoding standard applies to all MIME types
10:28
<qq[IrcCity]>
it’s a stage after decoding, not before.
10:29
<annevk>
qq[IrcCity]: actually, part of the HTML parser handles decoding
10:29
<qq[IrcCity]>
HTML decoding rules apply only to HTML.
10:30
<qq[IrcCity]>
and HTML5 make a special provision abour “irrelevent” confidence.
10:30
<qq[IrcCity]>
when browser operates in its internal text encoding.
10:31
<annevk>
I see, well, have a nice day
10:47
<qq[IrcCity]>
annevk, I think W3C people missed your out-of-context “Authors must use the utf-8” thing in CR-encoding-20140916, and possibly will be amazed with such your “novel” ideas as applying BOM sniffing to all media types including binaries ☺
10:47
<qq[IrcCity]>
may I quote this chat in mailing lists?
11:32
<qq[IrcCity]>
given the channel has logs publicly readable at krijnhoetmer.nl, I’ll proceed to quote the conversation without explicit permission.
11:36
<annevk>
Hah, you're in for a surprise ;-)
11:37
<annevk>
But yeah, feel free to quote. I tried to help you out, but it seems like you had some set of answers already
11:44
<qq[IrcCity]>
sure, I’m conservative. my “set of answers” is based on original HTTP/1.1, not diluted by accommodation to idiocy.
11:44
<annevk>
qq[IrcCity]: please stay civil
11:46
<qq[IrcCity]>
naïve people here. admins of httpd software were told for 17 years (since RFC 2068 till HTML5) they must specify Content-Type: text/whichever, charset=actual. many of them ignored it.
11:47
<qq[IrcCity]>
now you make BOM sniffing mandatory and hope it would be universal solution ☺
11:50
<annevk>
sounds about right
11:52
<qq[IrcCity]>
and naïve Prof. Dürst who boasts he pioneered reliability of heuristic UTF-8 detection as early as in 1997, but seemingly unaware how modern broswers “detect” UTF-8 :D
11:56
<qq[IrcCity]>
I would look to see what Martin J. Dürst would do when, eventually, learned about the algorithm promulgated by Anne van Kesteren and company. what would he say to Anne.
11:57
<annevk>
Pretty sure he knows about it
11:58
<qq[IrcCity]>
if the stream starts with «\357\273\277», then it’s damn UTF-8, no matter the fourth byte is \377 :D
12:18
<qq[IrcCity]>
one big problem: major browsers ceased to honour values in Content-Type, and not only for text/html. and two narrower cases: 1. broken text/plain (already),
12:18
<qq[IrcCity]>
2. such novelties as “the Encoding standard applies to all MIME types” (yet to happen to application/* and so on).
12:34
<nox>
How is text/plain broken?
12:36
<qq[IrcCity]>
nox, did you see test cases at http://www.superstructure.info/browser/compromised/toxic-sniffing.html#better ?
12:36
<qq[IrcCity]>
one should enter \357\273\277 and so on manually, of course.
12:39
<nox>
I don't understand.
12:39
<nox>
What spec does that test follow?
12:42
<qq[IrcCity]>
nox, testing compliancy with RFC 2616 was initially in mind. RFC 7231 is somewhat vague about Content-Type, but major browsers defy even its relaxed provisions.
12:42
<nox>
Where is the accommodation to idiocy?
12:44
<qq[IrcCity]>
nox: overriding explicitly serviced «charset=» with own guesses. BTW, what is no-sniff-flag?
12:48
<qq[IrcCity]>
in other words, Google Chrome tells its user that one lamer qq[IrcCity] mistook «charset=» and it (Google) knows better what was intended meaning.
12:49
<qq[IrcCity]>
even when the text its claim to be UTF-8 has \377 in the fourth octet.
12:49
<nox>
What is that text?
12:49
<nox>
The page is lacking many information. What was the actual charset you intended to transmit?
12:51
<qq[IrcCity]>
nox, there is a simpler test case at http://course.irccity.ru/ya-yu-9-amp.txt but about toxic UTF-16. two minutes, about to made the same for UTF-8.
12:53
<nox>
qq[IrcCity]: And why would \377 as the fourth octet change things btw?
12:53
<qq[IrcCity]>
from the point of view of RFC 2616, it’s at all irrelevant.
12:54
<qq[IrcCity]>
but it can show madness of BOM sniffing better.
12:55
<nox>
I see.
12:56
<nox>
Interestingly, forcing charset to Windows-1251 on that last link doesn't change anything in Safari.
12:57
<qq[IrcCity]>
http://course.irccity.ru/p-guillemet-yi-ya.txt shows toxic UTF-8.
12:57
<nox>
qq[IrcCity]: Then again, the question is whether your examples are the majority, or whether actually honouring 'charset' breaks more things.
12:58
<qq[IrcCity]>
examples are not about majority. they are about predictability.
12:59
<nox>
The rules you call bugs are predictable.
13:01
<qq[IrcCity]>
they are predictable only since some guys found certain WhatWG and developers agreed to follow recommendations. they aren’t predictable in the old world where protocols did matter.
13:02
<nox>
You do realise that the specs are how they are because actually the majority never cared about honouring 'charset', even in the old world?
13:02
<qq[IrcCity]>
nox, which majority? Russian-speaking sites mostly cared.
13:03
<qq[IrcCity]>
and nobody can predict new things will WhatWG invent tomorrow: UTF-8 sniffing for octet-stream or whatever.
13:03
<nox>
Dürst in a mail linked on the page says "Yes, the iso-8859-1 'default' was invalidated because there were millions and millions of documents for which it would have been wrong, especially in Eastern Europe and Asia."
13:03
<nox>
Are you talking about something else?
13:03
<qq[IrcCity]>
nox, about overriding explicitly serviced charset, again.
13:04
<qq[IrcCity]>
not old HTTP default.
13:07
<nox>
So I guess this actually broke a Russian-speaking site somewhere?
13:07
<qq[IrcCity]>
what??
13:07
<nox>
Not honouring Content-Type.
13:08
<qq[IrcCity]>
what do you speaking about? 90%+ pages contained charset= if not in Content-Type, then in HTML <meta> in the worst case.
13:10
<qq[IrcCity]>
if I am barred from saying п»їя in Windows-1251 (although all four characters pertain to the codepage), then the protocol is not honoured anymore. not a problem with HTML that may not start…
13:10
<qq[IrcCity]>
… from arbitrary characters anyway, but a problem for text/plain.
13:12
<nox>
My question is, is that an actual bug that actually breaks stuff, or is that just a theoretical problem that doesn't break anything in practice?
13:12
<nox>
You are not barred from saying such a thing in a text/plain document, you are barred from beginning a document with such a thing.
13:12
<nox>
No?
13:14
<nox>
Did you look at RFC5987 btw?
13:18
<nox>
Also, https://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-26#section-3.1.1.5
13:19
<qq[IrcCity]>
nox, how is RFC5987 relevant to the dispute? it is related to our case (RFC 7231) like RFC 1522 (i18n of headers) is related to RFC 1521 (i18n of bodies).
13:20
<nox>
qq[IrcCity]: Yeah disregard that one, lost myself in tabs.
13:20
<nox>
Meant to ask whether httpbis had some changes planned in that area, apparently no.
13:21
<qq[IrcCity]>
nox: so you agree that browsers defy HTTP semantics 3.1.1.5, aren’t you?
13:22
<nox>
I agree, but I'm not sure it's actually important.
13:23
<qq[IrcCity]>
I’m not aware anyone complained before me about that.
13:24
<qq[IrcCity]>
I made Codepage Explorer just for the case, to show an actual application that can be broken by unlucky combination of octets.
13:25
<nox>
It's a nice catch, but I'm afraid you'll just be told that it's not actually a problem in real world and thus nothing will be changed.
13:27
<qq[IrcCity]>
not a nice thing for me, since my trust in browsers ended abruptly. this is not that Internet I was accustomed to.
13:28
<qq[IrcCity]>
nowadays browsers lie you twice more than despisable Windows lied users in 1990s.
16:17
<annevk>
Ugh
16:18
<annevk>
ECMAScript still requires that `var x = new ArrayBuffer(10); postMessage(x, "*", [x]); console.log(x.byteLength)` throws, but no implementation does that
16:36
<annevk>
Oh, actually, I'm mistaken
17:33
<Domenic>
Hmm I think it does
17:34
<Domenic>
Last I checked TC39 still wants to try that. And no implementers want to try that. So awesome.
23:27
<MikeSmith>
I don't understand how it's possible that I never knew about http://devdocs.io/ before now
23:27
<MikeSmith>
do other people here know about it already?
23:28
<MikeSmith>
it seems extremely well done, as far as putting some very good UI/UX around aggregated docs from a bunch of different sources (e.g., MDN, but a ton of other stuff as well)
23:40
<jgraham>
Oh, I had heard of that but now it has Rust docs
23:40
<jgraham>
Seems like it could be more convenient than trying to remember where it installs them and start a web server