13:21
<annevk>
sideshowbarker: Domenic: if you could look at https://github.com/whatwg/dom/pull/1004 at least up until line 2100 that'd be terrific
13:27
<sideshowbarker>
annevk: will look now
14:53
<annevk>
Thanks sideshowbarker. To be clear though, I was thinking of aligning more prose with this new model if it looks good. There's actually quite a lot of text is impacted.
14:54
<sideshowbarker>
annevk: so you want suggestions on what else needs to be changed?
14:54
<annevk>
sideshowbarker: no, just if this direction looks good as a thing to expand upon; I can find the rest :-)
14:54
<sideshowbarker>
OK, I think in general it’s definitely an improvement
14:55
<sideshowbarker>
I am as usual more interested in the use case of web developers reading the spec than I am for implementors
14:56
<sideshowbarker>
as far as implementors, I get the impression that the existing spec works well overall for anybody implementing/writing code from it
14:56
<sideshowbarker>
but I think it’s a lot harder for developers to learn from
14:57
<annevk>
Apparently jsdom ran into issues with "is a Text node" not clearly including CDATASection nodes
14:57
<sideshowbarker>
aha
14:57
<sideshowbarker>
OK, yeah, I wondered who would actually be implementing these parts at this point
14:58
<annevk>
You don't really want to say Text or CDATASection however as otherwise you also need to say Element or HTMLAnchorElement or ... It gets tricky quickly
14:59
<sideshowbarker>
yeah, I can see that
14:59
<annevk>
Anyway, I hope that depending on CharacterData in a number of cases will make that more clear, as well as more clearly explaining what X node is.
15:00
<sideshowbarker>
I have recently been looking at some of the lowe-level DOM-related content at MDN, and to speak generously… it has a lot of room for improvement
15:00
<sideshowbarker>
there is actually nowhere any good “Introduction to the DOM” for web developers
15:01
<Domenic>
You don't really want to say Text or CDATASection however as otherwise you also need to say Element or HTMLAnchorElement or ... It gets tricky quickly
The difference is that when working in "DOM" code as opposed to "HTML" code, it's pretty natural to use .nodeType a lot.
15:01
<annevk>
Domenic: yeah that's fair, but there's nothing that corresponds to nodeType
15:02
<Domenic>
Yeah it's just very easy to think that "is a Text node" means .nodeType === TEXT_NODE
15:02
<Domenic>
I just opened the PR but if you use "implements" that will help.
18:37
<favonia>

Hi, I was checking the standard library of the Go programming language and found that the STD3 rules are disabled by default in the current WHATWG URL standard. I was a bit surprised because many Unicode characters in their KC normal forms could introduce forbidden host code points @, #, /, ... that are dangerous or at least misleading. The STD3 check protects us from those dangerous characters. I understand many existing hosts have the underscore _ in their names that is forbidden by the STD3, but by disabling the entire STD3 check, we become vulnerable to many other attacks based on Unicode normalization. I believe this must have been carefully discussed somewhere within WHATWG. If not, I wonder how I should comment on the current URL standard and propose new changes. I am familiar with GitHub operations but I never interacted with WHATWG before, and would appreciate your guidance in making a proposal.

TL;DR: I wish to add a warning about the danger of disabling the STD3 check and, if an application chooses to disable the STD3 check (possibly to allow the underscore), it should verify that forbidden host code points would never arise from normalization and mapping.

20:12
<Domenic>
favonia: I don't think I understand the issue. Because those are forbidden host code points parsing will fail if they appear. It doesn't matter for the spec whether the parsing fails due to Unicode STD3 checking in step 5 or because of the explicit step 7 in https://url.spec.whatwg.org/#concept-host-parser
21:06
<favonia>

As far as I understand, Step 7 does not prevent such attacks. One attack is that the same URL (record) has multiple Unicode normal forms that would be parsed differently. Here is an example:

https://google.com\uFF03@evil.com

A parser would give the following results:

But with its NFKC (where \uFF03 is normalized to #):

https://google.com#@evil.com

a parser would give these results instead:

So, the string and one of its normal forms are both valid URL records, but with different structures. The discrepancies can be exploited in many IDNA-aware applications to fool users or even bypass security checking. This attack is known as HostSplit.

21:13
<Domenic>
That is not how those URLs are parsed
21:13
<Domenic>
See https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9nb29nbGUuY29tI0BldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms=
21:14
<Domenic>
and https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9nb29nbGUuY29t77yDQGV2aWwuY29t&base=YWJvdXQ6Ymxhbms=
21:16
<Domenic>
I guess your larger point remains though
21:16
<Domenic>
Which is yes, different input strings can product different hosts
21:16
<Domenic>
Opening an issue to discuss that seems fine if you want?
21:17
<favonia>
Oops, sorry for my mistakes. It's hard to construct such contrived examples on the fly. 😛 Maybe https://localhost#@evil.com and https://localhost#@evil.com would work.
21:47
<favonia>
After some thinking I realized this is probably more serious than I thought---the STD3 rules in UTS#46 could prevent some attacks but seem powerless to handle https://localhost#@evil.com v.s. https://localhost#@evil.com where the problematic character is never part of the host/domain name. 😱
21:50
<favonia>
I can start a GitHub issue, though I will not be able to meaningfully participate in the discussions probably after two weeks. (I'm teaching in a university and the semester is starting...) Also, I am not an expert on Unicode/URL, merely a concerned user after reading these documents. There might be many strange corner cases that I am not aware of. Therefore, perhaps someone else should take the lead? I can still take the initiative.
21:58
<favonia>
PS: consistently applying STD3 rules can probably detect something like https://cool.asi℀.evil.com where could be normalized to a/c.
22:01
<Domenic>
https://cool.asi℀.evil.com is just an invalid URL https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9jb29sLmFzaeKEgC5ldmlsLmNvbQ==&base=YWJvdXQ6Ymxhbms=