06:18 | <hsivonen> | hsivonen: you may enjoy borrowing from https://github.com/jsdom/html-encoding-sniffer/tree/master/test . Not very exhaustive I imagine but it caught a few bugs in our implementation as you can see from https://github.com/jsdom/html-encoding-sniffer/commits/master/test |
06:20 | <hsivonen> | I think for starters, I'm going to turn https://hsivonen.com/test/moz/meta/ into both WPT reftests and Gecko reftests (for non-file: and file: ) and into same-origin-framed scripted WPTs. |
09:44 | <Jake Archibald> | annevk: Why don't we add Origin to non-CORS GET requests? Do some servers assume that the presence of the Origin header means its a CORS request and block it? |
09:51 | <annevk> | Jake Archibald: basically |
09:54 | <Jake Archibald> | ta |
11:29 | <hsivonen> | Aargh. I suspect that Chrome doesn't limit the after-head meta charset scan to exactly 1024 bytes. I guess I need to write a variable-offset test case to find out. |
11:30 | <hsivonen> | time to learn how PHP works these days, I guess. |
11:30 | <Ms2ger 💉💉> | Should it start or end in the 1024? 🤔 |
11:31 | <hsivonen> | That's part of the question. I thought the semantics were that the stream was cut to 1024 bytes and then the > had to be within that part for the tokenizer to emit the token, but I guess that's not the whole story. |
11:41 | <hsivonen> | Fun. Either PHP or nginx adds a charset parameter to the HTTP header if I don't supply one. Fortunately, charset=bogus should work for this purpose. |
11:42 | <hsivonen> | Clearly not optimized for writing encoding test cases. |
11:54 | <annevk> | Domenic: you didn't weigh in yet, but are you okay with merging credentialless with the added warning? |
11:55 | <annevk> | hsivonen: why resort to PHP and not wpt's Python infrastructure? |
11:56 | <hsivonen> | hsivonen: why resort to PHP and not wpt's Python infrastructure? |
11:56 | <hsivonen> | (And I want a test with a public URL) |
11:56 | <annevk> | hsivonen: but you can just run ./wpt serve locally no to get a server? |
11:56 | <annevk> | Ah, okay |
12:09 | <hsivonen> | Aaand, the answer is that the < of the meta has to be within the first 1024 bytes: https://hsivonen.com/test/moz/meta/after-head-variable.php?start=1023 vs. https://hsivonen.com/test/moz/meta/after-head-variable.php?start=1024 . This makes everything more complicated. :-( |
12:12 | <Jake Archibald> | annevk: Fetch is pretty strict when it comes to the format of safelisted request headers, but it isn't strict about the format of the request body. Why's that? |
12:26 | <hsivonen> | Well, the glass half-full view is that at least the start boundary is well-defined instead of being something like "whichever network buffer contains the 1024th byte" |
12:27 | <Andreu Botella (he/they)> | Are you testing network buffers? Because 1024 is a round number |
12:27 | <hsivonen> | Good point! |
12:29 | <hsivonen> | Except if there was a boundary at 1024, then the end of the token would be in the next buffer. |
12:29 | <hsivonen> | Still, I guess I have to test this. 😠|
12:37 | <hsivonen> | Adding a buffer boundary doesn't appear to change things: https://hsivonen.com/test/moz/meta/after-head-variable.php?start=1023&flush=1030 |
12:44 | <hsivonen> | A quick source inspection indicates that the check is on characters after Latin1 decode and not a check on network buffers, but it's not immediately obvious to me why the check ends up checking the start of the token as opposed to checking the end of the token. |
12:50 | <hsivonen> | Fun times that < can open something other than a tag, so if at 1024 the tokenizer isn't in data state, it's necessary to watch also for a comment, etc., ending. |
13:06 | <annevk> | Jake Archibald: mostly because restricting the headers was still feasible; but you can send pretty arbitrary bodies using <form> already so that was already considered pretty much unguarded |
13:07 | <Jake Archibald> | I guess text/plain forms allow pretty much anything |
13:07 | <Jake Archibald> | Ta! |
13:08 | <annevk> | Yeah exactly. Still not quite arbitrary bytes so maybe we should have drawn a harder line there. In the early days of CORS this didn't get as much consideration as it probably should have. |
13:15 | <Andreu Botella (he/they)> | text/plain form payloads are /^.*=.*\r\n$/ , which IMO doesn't merit a harder line |
13:19 | <annevk> | Fair, you can do a lot with ASCII. 😊 |
14:44 | <Domenic> | Domenic: you didn't weigh in yet, but are you okay with merging credentialless with the added warning? |