WHATWG on 2021-08-04

00:12	<sideshowbarker>	https://stackoverflow.com/questions/68641231/does-javascripts-abortable-fetch-close-the-http-connection
00:29	<favonia>	https://cool.asi℀.evil.com is just an invalid URL https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9jb29sLmFzaeKEgC5ldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms= Thank you. Now I see that the Step 7 you mentioned earlier implements a weaker version of the STD3 check that stopped my attack. However, I found another issue showing that whatwg-url probably violated (at least the spirit of) UTS#46. UTS#46 says: ... U+2260 ( ≠ ) NOT EQUAL TO U+226E ( ≮ ) NOT LESS-THAN U+226F ( ≯ ) NOT GREATER-THAN ... If an implementation uses `UseSTD3ASCIIRules=false` but disallows any of these three ASCII characters, then it must also disallow the corresponding precomposed character for its negation. The URL standard forbids `<` and `>`, so I feel `≮` and `≯` should be banned as well. I am happy to open a GitHub issue on this (smaller issue). https://jsdom.github.io/whatwg-url/#url=d3M6Ly88&base=YWJvdXQ6Ymxhbms= https://jsdom.github.io/whatwg-url/#url=d3M6Ly/iia4=&base=YWJvdXQ6Ymxhbms=
00:36	<Domenic>	Yes, I think this is a case where conforming to the rules doesn't really buy anything. Encoding ≮ to punycode seems fine.
00:37	<Domenic>	It might be clearer if you think of the URL Standard as a standalone document that gives the full processing model. The fact that it calls into some specific Unicode algorithms with some parameters is interesting, but is just an implementation detail and isn't meant to indicate any greater alignment with the philosophies of those documents.
00:42	<favonia>	alright I will skip the reporting. it's perhaps an interesting technical point, though
00:46	<favonia>	Sorry I accidentally pressed Enter to create an issue when doing some complex editing on GitHub. Please give me some time to fix that :-/
01:04	<favonia>	Domenic: Is https://jsdom.github.io/whatwg-url some website I can/should cite in my reporting? The tool is very convenient and I wonder if it's "permanent" in any sense.
01:39	<Domenic>	favonia: yes, feel free to use that site.
02:29	<favonia>	favonia: yes, feel free to use that site. Done! https://github.com/whatwg/url/issues/626
14:25	<favonia>	After checking the Unicode tables more carefully, I must disagree with this judgment and admit UTS#46 has done the right thing. My latest reporting was mainly about NFKC and NFKD, but if we agree that the standard should prevent problematic characters on that basis, then ≮, ≯ or even ≠ could also generate <, > or = under NFD and should be banned as well.
14:34	<annevk>	That reading is only correct under the assumption that NFD(URL string) is a valid operation, which it's not; as long as you do URL parser(URL string) I don't think you have demonstrated an issue
14:35	<annevk>	I could see banning more code points out of caution (though we cannot ban all, e.g., `_` is important), but I wouldn't classify these as a problem with the URL parser
14:39	<favonia>	well... I did not imply that the current URL parser itself is wrong. I was only proposing to restrict valid URLs as you suggested. it would be kind of you to cite exact phrases which gave you such an impression so that I can revise my proposal.
14:42	<favonia>	Also, you seemed assume no normalization should be applied to URL strings. That's against (at least the spirit of) the W3C recommendations. NFC should be applied to URL strings as well. There is a related page made by W3C Internationalization Activity: https://www.w3.org/International/questions/qa-html-css-normalization
14:51	<annevk>	favonia: if you apply normalization at that level though, there is nothing the URL parser can do about it, because you cannot distinguish it from a URL that contains ASCII `#`
14:52	<favonia>	I want to repeat that I never implied that the URL parsing is at fault. Could you possibly cite the phrases that gave you such an impression? It seems we miscommunicated and I want to clear up the misunderstanding.
14:53	<annevk>	favonia: it might be worth raising with www-international@w3.org as it's a somewhat interesting case; you cannot take a URL from somewhere, validate its scheme and host, then put the input string in HTML that gets normalized; you'd have to put the serialization in which might not be something folks realize
14:54	<annevk>	favonia: you start out with talking about URL records, and URL records are the result of parsing
14:54	<favonia>	yes, but the parser is not the problem. at least not in my opinion.
14:55	<annevk>	"A proper fix would probably be similar to the sanitization of host names." Isn't that a parser change?
14:55	<favonia>	I could see banning more code points out of caution (though we cannot ban all, e.g., `_` is important), but I wouldn't classify these as a problem with the URL parser no, this is what I meant. thank you for citing exact phrases so that I can prevent other people from misunderstanding the proposal
14:57	<annevk>	To be clear, that would be a change to the URL parser aimed at helping scenarios where people parse a URL string to validate it and then somehow output NFX(URL string) elsewhere, which is a somewhat problematic practice for various reasons
14:58	<annevk>	E.g., the changes to the URL's path or query are not something we could prevent in that way
14:58	<annevk>	(I gotta go for a bit)
15:02	<favonia>	To be clear, that would be a change to the URL parser aimed at helping scenarios where people parse a URL string to validate it and then somehow output NFX(URL string) elsewhere, which is a somewhat problematic practice for various reasons technically yes, but I think there's a difference between only enlarging the set of forbidden/disrecommended characters and changing the structure of the parser
15:06	<annevk>	A difference in what sense?
15:30	<favonia>	Could you possibly elaborate more so that I can better answer it? I am happy to simply admit it's a change to the parser.
15:32	<favonia>	I don't feel how I personally classify different levels of changes matters here. If WHATWG thinks it's a major change, then it's a major change. If WHATWG thinks it's a minor change, then it's a minor change. I am happy to eliminate different usages in terminology in case it helps communication.
15:38	<annevk>	I just noticed that `＃` is already rejected when it's part of a host, but it's not rejected when used as a username or password. So STD3 Rules wouldn't matter there either way.
15:39	<annevk>	As in, `https://test＃test/` results in failure.
15:39	<favonia>	that's correct. proper checking has been done for host names in the current standard.
15:40	<annevk>	So yeah, I don't think this is something that can be changed. Those components are expected to allow arbitrary scalar values.
15:41	<annevk>	It might be worth calling out somewhere though and I also think raising this with www-international could be worthwhile.
15:42	<favonia>	as a disclaimer I only started reading these documents like 2-3 days ago. as a naive suggestion how about demanding percent-encoding?
15:47	<annevk>	favonia: it would be a breaking change, URL strings have allowed U+FF03 (＃) for well over a decade
15:48	<annevk>	It's easier to ban certain things in hosts because they don't resolve anyway, but we cannot do that for paths and such
15:54	<favonia>	got it. well, I feel no one is aganist a warning about normalization forms other than NFC, then? I can actually be satisfied with just that. following your comment, I guess banning ≮, ≯ in host names would be fine? it has almost zero practical impacts while saving us from some terrible situations due to mishandling of host names.
15:57	<annevk>	Well, e.g., ≮ becomes xn--gdh, and it's not clear we can make that inaccessible. And generally we forbid things after ToASCII succeeds, not before. So for that one I'm not sure. It would also depend on how we resolve various other longstanding IDNA issues.
15:58	<annevk>	If you apply NFD or some such to a domain name and then pass it to a host parser you are already likely to end up on the wrong website so it's not clear this would prevent all of these attacks so it might be better if sites address the root cause.
16:00	<annevk>	Heck, if you apply NFD to HTML in general you would open yourself up to all kinds of attacks.
16:00	<annevk>	There's a reason you want NFC unless you do some kind of specialized text processing.
16:01	<annevk>	You might enjoy http://www.diveintomark.link/2004/unicode-normalization-form-c
16:02	<favonia>	Hah, it's funny and to the point. 😆
16:03	<favonia>	to be fair KD would be even worse
16:11	<favonia>	If you apply NFD or some such to a domain name and then pass it to a host parser you are already likely to end up on the wrong website so it's not clear this would prevent all of these attacks so it might be better if sites address the root cause. No you will be fine even after NFD unless the application has serious bugs. See https://unicode.org/reports/tr46/#ProcessingStepNormalize. You need to compute NFC which would undo the "damage". At least Firefox got this correct.
16:20	<favonia>	`whatwg-url` gives the correct result as well (which is not really surprising because the `tr46` package used by `whatwg-url` correctly implements UTS46) https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly88zLg=&base=YWJvdXQ6Ymxhbms=
16:24	<favonia>	anyways, personally I am much less motivated to promote the banning of ≮ and ≯ in host names because it seems significantly harder to construct concrete attacks. it's just a possibility.
16:25	<annevk>	Ah right, I guess you would need to apply one of the NFKx variants as those cannot be reversed if I remember correctly
16:25	<annevk>	It's been a long time since I looked at this in detail
16:28	<favonia>	as far as I have read (did I say I started the reading only days ago?), that seems to be the case. to be more precise, NFX is in general irreversible, but we only care about whether NFC(NFX(input)) = NFC(input) for some X here.
23:14	<favonia>	annevk Domenic Thank you for your guidance on GitHub. I however think the issue is going nowhere---I did not sense any positive support even for the suggestion to put in some warnings in the standard, and I am not willing to put in more efforts in convincing members of the WHATWG community. Is it okay for me to simply close the issue?