13:21 | <annevk> | sideshowbarker: Domenic: if you could look at https://github.com/whatwg/dom/pull/1004 at least up until line 2100 that'd be terrific |
13:27 | <sideshowbarker> | annevk: will look now |
14:53 | <annevk> | Thanks sideshowbarker. To be clear though, I was thinking of aligning more prose with this new model if it looks good. There's actually quite a lot of text is impacted. |
14:54 | <sideshowbarker> | annevk: so you want suggestions on what else needs to be changed? |
14:54 | <annevk> | sideshowbarker: no, just if this direction looks good as a thing to expand upon; I can find the rest :-) |
14:54 | <sideshowbarker> | OK, I think in general it’s definitely an improvement |
14:55 | <sideshowbarker> | I am as usual more interested in the use case of web developers reading the spec than I am for implementors |
14:56 | <sideshowbarker> | as far as implementors, I get the impression that the existing spec works well overall for anybody implementing/writing code from it |
14:56 | <sideshowbarker> | but I think it’s a lot harder for developers to learn from |
14:57 | <annevk> | Apparently jsdom ran into issues with "is a Text node" not clearly including CDATASection nodes |
14:57 | <sideshowbarker> | aha |
14:57 | <sideshowbarker> | OK, yeah, I wondered who would actually be implementing these parts at this point |
14:58 | <annevk> | You don't really want to say Text or CDATASection however as otherwise you also need to say Element or HTMLAnchorElement or ... It gets tricky quickly |
14:59 | <sideshowbarker> | yeah, I can see that |
14:59 | <annevk> | Anyway, I hope that depending on CharacterData in a number of cases will make that more clear, as well as more clearly explaining what X node is. |
15:00 | <sideshowbarker> | I have recently been looking at some of the lowe-level DOM-related content at MDN, and to speak generously… it has a lot of room for improvement |
15:00 | <sideshowbarker> | there is actually nowhere any good “Introduction to the DOM” for web developers |
15:01 | <Domenic> | You don't really want to say Text or CDATASection however as otherwise you also need to say Element or HTMLAnchorElement or ... It gets tricky quickly |
15:01 | <annevk> | Domenic: yeah that's fair, but there's nothing that corresponds to nodeType |
15:02 | <Domenic> | Yeah it's just very easy to think that "is a Text node" means .nodeType === TEXT_NODE |
15:02 | <Domenic> | I just opened the PR but if you use "implements" that will help. |
18:37 | <favonia> | Hi, I was checking the standard library of the Go programming language and found that the STD3 rules are disabled by default in the current WHATWG URL standard. I was a bit surprised because many Unicode characters in their KC normal forms could introduce forbidden host code points TL;DR: I wish to add a warning about the danger of disabling the STD3 check and, if an application chooses to disable the STD3 check (possibly to allow the underscore), it should verify that forbidden host code points would never arise from normalization and mapping. |
20:12 | <Domenic> | favonia: I don't think I understand the issue. Because those are forbidden host code points parsing will fail if they appear. It doesn't matter for the spec whether the parsing fails due to Unicode STD3 checking in step 5 or because of the explicit step 7 in https://url.spec.whatwg.org/#concept-host-parser |
21:06 | <favonia> | As far as I understand, Step 7 does not prevent such attacks. One attack is that the same URL (record) has multiple Unicode normal forms that would be parsed differently. Here is an example:
A parser would give the following results:
But with its NFKC (where
a parser would give these results instead:
So, the string and one of its normal forms are both valid URL records, but with different structures. The discrepancies can be exploited in many IDNA-aware applications to fool users or even bypass security checking. This attack is known as HostSplit. |
21:13 | <Domenic> | That is not how those URLs are parsed |
21:13 | <Domenic> | See https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9nb29nbGUuY29tI0BldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms= |
21:14 | <Domenic> | and https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9nb29nbGUuY29t77yDQGV2aWwuY29t&base=YWJvdXQ6Ymxhbms= |
21:16 | <Domenic> | I guess your larger point remains though |
21:16 | <Domenic> | Which is yes, different input strings can product different hosts |
21:16 | <Domenic> | Opening an issue to discuss that seems fine if you want? |
21:17 | <favonia> | Oops, sorry for my mistakes. It's hard to construct such contrived examples on the fly. 😛 Maybe https://localhost#@evil.com and https://localhost#@evil.com would work. |
21:47 | <favonia> | After some thinking I realized this is probably more serious than I thought---the STD3 rules in UTS#46 could prevent some attacks but seem powerless to handle https://localhost#@evil.com v.s. https://localhost#@evil.com where the problematic character is never part of the host/domain name. 😱 |
21:50 | <favonia> | I can start a GitHub issue, though I will not be able to meaningfully participate in the discussions probably after two weeks. (I'm teaching in a university and the semester is starting...) Also, I am not an expert on Unicode/URL, merely a concerned user after reading these documents. There might be many strange corner cases that I am not aware of. Therefore, perhaps someone else should take the lead? I can still take the initiative. |
21:58 | <favonia> | PS: consistently applying STD3 rules can probably detect something like https://cool.asi℀.evil.com where ℀ could be normalized to a/c . |
22:01 | <Domenic> | https://cool.asi℀.evil.com is just an invalid URL https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9jb29sLmFzaeKEgC5ldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms= |