00:12 | <sideshowbarker> | https://stackoverflow.com/questions/68641231/does-javascripts-abortable-fetch-close-the-http-connection |
00:29 | <favonia> | https://cool.asi℀.evil.com is just an invalid URL https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9jb29sLmFzaeKEgC5ldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms= Thank you. Now I see that the Step 7 you mentioned earlier implements a weaker version of the STD3 check that stopped my attack. However, I found another issue showing that whatwg-url probably violated (at least the spirit of) UTS#46. UTS#46 says:
The URL standard forbids https://jsdom.github.io/whatwg-url/#url=d3M6Ly88&base=YWJvdXQ6Ymxhbms= |
00:36 | <Domenic> | Yes, I think this is a case where conforming to the rules doesn't really buy anything. Encoding ≮ to punycode seems fine. |
00:37 | <Domenic> | It might be clearer if you think of the URL Standard as a standalone document that gives the full processing model. The fact that it calls into some specific Unicode algorithms with some parameters is interesting, but is just an implementation detail and isn't meant to indicate any greater alignment with the philosophies of those documents. |
00:42 | <favonia> | alright I will skip the reporting. it's perhaps an interesting technical point, though |
00:46 | <favonia> | Sorry I accidentally pressed Enter to create an issue when doing some complex editing on GitHub. Please give me some time to fix that :-/ |
01:04 | <favonia> | Domenic: Is https://jsdom.github.io/whatwg-url some website I can/should cite in my reporting? The tool is very convenient and I wonder if it's "permanent" in any sense. |
01:39 | <Domenic> | favonia: yes, feel free to use that site. |
02:29 | <favonia> | favonia: yes, feel free to use that site. |
14:25 | <favonia> | After checking the Unicode tables more carefully, I must disagree with this judgment and admit UTS#46 has done the right thing. My latest reporting was mainly about NFKC and NFKD, but if we agree that the standard should prevent problematic characters on that basis, then ≮, ≯ or even ≠ could also generate <, > or = under NFD and should be banned as well. |
14:34 | <annevk> | That reading is only correct under the assumption that NFD(URL string) is a valid operation, which it's not; as long as you do URL parser(URL string) I don't think you have demonstrated an issue |
14:35 | <annevk> | I could see banning more code points out of caution (though we cannot ban all, e.g., _ is important), but I wouldn't classify these as a problem with the URL parser |
14:39 | <favonia> | well... I did not imply that the current URL parser itself is wrong. I was only proposing to restrict valid URLs as you suggested. it would be kind of you to cite exact phrases which gave you such an impression so that I can revise my proposal. |
14:42 | <favonia> | Also, you seemed assume no normalization should be applied to URL strings. That's against (at least the spirit of) the W3C recommendations. NFC should be applied to URL strings as well. There is a related page made by W3C Internationalization Activity: https://www.w3.org/International/questions/qa-html-css-normalization |
14:51 | <annevk> | favonia: if you apply normalization at that level though, there is nothing the URL parser can do about it, because you cannot distinguish it from a URL that contains ASCII # |
14:52 | <favonia> | I want to repeat that I never implied that the URL parsing is at fault. Could you possibly cite the phrases that gave you such an impression? It seems we miscommunicated and I want to clear up the misunderstanding. |
14:53 | <annevk> | favonia: it might be worth raising with www-international@w3.org as it's a somewhat interesting case; you cannot take a URL from somewhere, validate its scheme and host, then put the input string in HTML that gets normalized; you'd have to put the serialization in which might not be something folks realize |
14:54 | <annevk> | favonia: you start out with talking about URL records, and URL records are the result of parsing |
14:54 | <favonia> | yes, but the parser is not the problem. at least not in my opinion. |
14:55 | <annevk> | "A proper fix would probably be similar to the sanitization of host names." Isn't that a parser change? |
14:55 | <favonia> | I could see banning more code points out of caution (though we cannot ban all, e.g., |
14:57 | <annevk> | To be clear, that would be a change to the URL parser aimed at helping scenarios where people parse a URL string to validate it and then somehow output NFX(URL string) elsewhere, which is a somewhat problematic practice for various reasons |
14:58 | <annevk> | E.g., the changes to the URL's path or query are not something we could prevent in that way |
14:58 | <annevk> | (I gotta go for a bit) |
15:02 | <favonia> | To be clear, that would be a change to the URL parser aimed at helping scenarios where people parse a URL string to validate it and then somehow output NFX(URL string) elsewhere, which is a somewhat problematic practice for various reasons |
15:06 | <annevk> | A difference in what sense? |
15:30 | <favonia> | Could you possibly elaborate more so that I can better answer it? I am happy to simply admit it's a change to the parser. |
15:32 | <favonia> | I don't feel how I personally classify different levels of changes matters here. If WHATWG thinks it's a major change, then it's a major change. If WHATWG thinks it's a minor change, then it's a minor change. I am happy to eliminate different usages in terminology in case it helps communication. |
15:38 | <annevk> | I just noticed that # is already rejected when it's part of a host, but it's not rejected when used as a username or password. So STD3 Rules wouldn't matter there either way. |
15:39 | <annevk> | As in, https://test#test/ results in failure. |
15:39 | <favonia> | that's correct. proper checking has been done for host names in the current standard. |
15:40 | <annevk> | So yeah, I don't think this is something that can be changed. Those components are expected to allow arbitrary scalar values. |
15:41 | <annevk> | It might be worth calling out somewhere though and I also think raising this with www-international could be worthwhile. |
15:42 | <favonia> | as a disclaimer I only started reading these documents like 2-3 days ago. as a naive suggestion how about demanding percent-encoding? |
15:47 | <annevk> | favonia: it would be a breaking change, URL strings have allowed U+FF03 (#) for well over a decade |
15:48 | <annevk> | It's easier to ban certain things in hosts because they don't resolve anyway, but we cannot do that for paths and such |
15:54 | <favonia> | got it. well, I feel no one is aganist a warning about normalization forms other than NFC, then? I can actually be satisfied with just that. following your comment, I guess banning ≮, ≯ in host names would be fine? it has almost zero practical impacts while saving us from some terrible situations due to mishandling of host names. |
15:57 | <annevk> | Well, e.g., ≮ becomes xn--gdh, and it's not clear we can make that inaccessible. And generally we forbid things after ToASCII succeeds, not before. So for that one I'm not sure. It would also depend on how we resolve various other longstanding IDNA issues. |
15:58 | <annevk> | If you apply NFD or some such to a domain name and then pass it to a host parser you are already likely to end up on the wrong website so it's not clear this would prevent all of these attacks so it might be better if sites address the root cause. |
16:00 | <annevk> | Heck, if you apply NFD to HTML in general you would open yourself up to all kinds of attacks. |
16:00 | <annevk> | There's a reason you want NFC unless you do some kind of specialized text processing. |
16:01 | <annevk> | You might enjoy http://www.diveintomark.link/2004/unicode-normalization-form-c |
16:02 | <favonia> | Hah, it's funny and to the point. 😆 |
16:03 | <favonia> | to be fair KD would be even worse |
16:11 | <favonia> | If you apply NFD or some such to a domain name and then pass it to a host parser you are already likely to end up on the wrong website so it's not clear this would prevent all of these attacks so it might be better if sites address the root cause. |
16:20 | <favonia> | whatwg-url gives the correct result as well (which is not really surprising because the tr46 package used by whatwg-url correctly implements UTS46) https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly88zLg=&base=YWJvdXQ6Ymxhbms= |
16:24 | <favonia> | anyways, personally I am much less motivated to promote the banning of ≮ and ≯ in host names because it seems significantly harder to construct concrete attacks. it's just a possibility. |
16:25 | <annevk> | Ah right, I guess you would need to apply one of the NFKx variants as those cannot be reversed if I remember correctly |
16:25 | <annevk> | It's been a long time since I looked at this in detail |
16:28 | <favonia> | as far as I have read (did I say I started the reading only days ago?), that seems to be the case. to be more precise, NFX is in general irreversible, but we only care about whether NFC(NFX(input)) = NFC(input) for some X here. |
23:14 | <favonia> | annevk Domenic Thank you for your guidance on GitHub. I however think the issue is going nowhere---I did not sense any positive support even for the suggestion to put in some warnings in the standard, and I am not willing to put in more efforts in convincing members of the WHATWG community. Is it okay for me to simply close the issue? |