| 10:44 | <Jake Archibald> | There's still some discussion over how it should behave with script & preloading https://github.com/whatwg/html/issues/11669#issuecomment-3550088089. Do you think this blocks moving to stage 2? If not I'm fine with it and we can continue to discuss. zcorpan does this make sense with the stages model? |
| 11:10 | <annevk> | foolip: why doesn't it add runScripts to setHTMLUnsafe? |
| 11:13 | <foolip> | annevk: That's just an oversight. Do you think we should add runScripts to setHTMLUnsafe() independently (before) streamHTMLUnsafe(), or do it all in one PR? |
| 11:14 | <annevk> | I don't have a preference, except that setHTMLUnsafe() getting it later doesn't seem good. |
| 11:21 | <foolip> | Yeah, that wouldn't make sense |
| 11:21 | <annevk> | I left some other comments on the PR. Forgot to comment on runScripts. 😅 |
| 11:21 | <foolip> | I'll add it to the PR so that it's somewhere for now. |
| 11:27 | <Noam Rosenthal> | I'm not sure there is actually much of an issue there; we can tee those streams to use the speculative parser the same way it's done in the main parser. The mechanism of that is kind of loosely spec'ed to begin with though |
| 11:28 | <Jake Archibald> | Noam Rosenthal: ohh, I didn't realise that the buffering to allow scripts to run before adding the next element stuff had made it into the spec |
| 11:30 | <Noam Rosenthal> | It sort of did. The stream that goes into the parser is teed to the main parser and speculative parser, but a lot of details there are left to implementations. In either case, I checked it in the spec and streaming to an element doesn't have an effective change on this (though of course this needs to be properly tested). |
| 11:41 | <Jake Archibald> |
That sounds… bad :D |
| 11:42 | <Jake Archibald> | Noam Rosenthal: I really think the stuff in https://github.com/whatwg/html/issues/11669#issuecomment-3537477657 should be interoperable |
| 11:47 | <Noam Rosenthal> | That stuff is interoperable. I meant that the speculative parsing behaviour might not be entirely interoperable. |
| 11:55 | <Jake Archibald> | oh yeah that's fine |
| 12:03 | <Jake Archibald> | Noam Rosenthal: someone's asking "why contentname and not id", and I couldn't find the relevant discussion |
| 12:03 | <Jake Archibald> | https://x.com/royalicing/status/1991714869012582584 |
| 12:06 | <foolip> | Thanks for the feedback! I've addressed the simple stuff now. For the HTML fragment parsing algorithm refactor I'm concerned that the diff will be unreadable if I just do it. In the end there should be one algorithm, so maybe as a middle ground I could add "(streaming case)" markers to some steps to show exactly where the changes are instead of waving at them from the outside? |
| 12:08 | <annevk> | I think you need to modify the core HTML parser as well, at least for some of the things we discussed. And ideally some of that is tidied up as it's all rather hand-wavy currently and adding this on top will make it even more fragile. So yeah, maybe do some refactoring first. |
| 12:10 | <Noam Rosenthal> | It's an XSS mitigation, to avoid new streaming modifying markup that doesn't expect it. It was raised in the first WHATNOT about this and also by our security folks. It was discussed here https://docs.google.com/document/d/1iaarr4Ho715CUULrvi_LD3TwshAcN2odDLBBEK0FjH0/edit?tab=t.0#heading=h.oh1h778cnny but the minutes are lacking/missing |
| 12:12 | <Noam Rosenthal> | ... also, contentname is not meant to be document-unique like ID. it is resolved from the template's parent downwards (or for the whole document if the template is a direct child of the body) |
| 13:53 | <sb3nder> | bkardell: Do you have any insight? |
| 14:45 | <foolip> | For Sanitizer definitely, although I think that has to be a different PR. But also for runScripts, at least https://html.spec.whatwg.org/#parsing-main-inhead:already-started should be modified. |
| 14:47 | <foolip> | annevk: but this really seems more like stage 3 ("Complete specification text") than stage 2 ("draft specification") to me, don't you think so? |
| 15:19 | <annevk> | foolip: are you sure that's only for sanitizer? For instance, the way it is currently specified declarative shadow trees would not work, because there's no separate parser document and thus you would get the settings from the main document parser. I don't think that's what we want. |
| 15:23 | <annevk> | foolip: anyway, API-shape-wise this is all seems fairly settled and it does seem somewhat unlikely we'll have to deviate very far from this as we define things better. But it's a big gap and I really hope this will come with some improvements to how we define the HTML parser in general, because I don't know if you've noticed, but people keep reporting new issues with it. |
| 15:29 | <foolip> | @annevk:matrix.org: My current thinking is to define an algorithm that works for streamHTMLUnsafe() and then wrap that in a synchronous form for setHTUnsafe() and other existing APIs. I'm not sure how exactly though because streams primitives aren't synchronous so it will have to be something else. Maybe just a parser object that can be fed chunks. |
| 15:31 | <foolip> | That isn't the most interesting stuff however. What makes scripts and DSD work differently in different parser modes is much more interesting. For scripts I'm pretty sure of how to do it, for DSD I'll need to look at where we get the setting from in the absence of a separate parser document. |
| 15:36 | <annevk> | I'm not sure it's just scripts and declarative shadow trees. I can imagine there might also be differences between node creation and insertion that would show up. Because as currently proposed the node would not be created in an "inert" document. I suspect that needs to change so it's the same as setHTMLUnsafe() and friends. |
| 17:10 | <foolip> | This aspect would be more like the main parser where nodes are created in and inserted into the main document. There might definitely be surprises hiding here since it's not quite like the main parser and not quite the fragment parser, but I'm not sure that bouncing nodes off an inert document is really better. Noam Rosenthal can you think of any observable differences between those two models? |
| 17:19 | <annevk> | foolip: well something like that is clearly needed for the Sanitizer and we don't want the design for streamHTMLUnsafe() to be fundamentally different from streamHTML(). |
| 17:19 | <annevk> | foolip: so solving for the Sanitizer is definitely going to be needed to some extent for v1. Much more so than with setHTMLUnsafe() as there it was very clear what needed to be done. |
| 17:22 | <Noam Rosenthal> | In blink it's basically the fragment parser, but when it's time to insert to the the root created in step 7, we insert directly to the context node.There is no inert document in the blink implementation in the first place - the existing setHTMLUnsafe and friends it's basically a DocumentFragment and those elements are inserted to that fragment directly. |
| 17:32 | <annevk> | Sure, I suspect all implementations optimize that away, but the observables are the same if you go through a DocumentFragment. Not so much if you skip that. |
| 17:45 | <Noam Rosenthal> | ... but if the difference was between node creation and insertion times, the DocumentFragment in the middle is not what would matter, but rather the inert document... e.g. if custom elements were created at some point and then asynchronously inserted much later it would be observable with the custom element constructor or some such. The differences we've found so far are the ones you'd expect due to streaming (e.g. classic scripts run immediately) |
| 17:47 | <Noam Rosenthal> | ... at least nothing obvious came up from using the fragment parser in this way and just adjusting the insertion point when inserting into the root. It doesn't mean that there aren't surprises somewhere, the parser is pretty big. |
| 17:50 | <annevk> | That's a fair point. I wonder how that works for img for instance. That would mean it starts loading before the Sanitizer API gets to it. |
| 17:54 | <Noam Rosenthal> | I don't think there's an observable time difference between creating an <img> and inserting it though |
| 18:02 | <Noam Rosenthal> | Ah I guess sanitizer changes that. But IIRC this kind of loading side-effect is done in a separate task |
| 18:09 | <Noam Rosenthal> | Anyway, if this is a problem, I think we can create the elements in the inert doc (exactly like the fragment parser) and adjust the insertion point - which would adopt them on insertion. This would be very close to today's detached document streaming technique and won't have this side effect (which exists today in implementation, in a way that would only be observable when we introduce the sanitizer). |