04:12
<Kaiido>
jmdyck: I believe it is correct albeit not clear, the Text nodes are being excluded, they are the descendants of <script> or svg:script that are descendants of the <legend>. So they're indeed "descendants of descendants".
08:06
<sideshowbarker>
Phil Pizlo finally putting his Fil-C work to good use: https://yoshell.ai 😀
11:22
<foolip>
zcorpan: I'm trying to understand how <script>console.log("</script>")</script> manages to "back up" when it sees the ">" inside the console.log string so that the script element text is just 'console.log("'. Do you know what tokenizer state this is? It looks to me like "Script data double escape start state" or "Script data double escape end state", but it looks like 'console.log("</script' has already been emitted as character tokens at that point. How does the tokenizer "take back" what's already been emitted?
11:32
<zcorpan>
foolip: it doesn't back up it just closes the script element there.
https://html.spec.whatwg.org/#script-data-state
https://html.spec.whatwg.org/#script-data-less-than-sign-state
https://html.spec.whatwg.org/#script-data-end-tag-open-state
https://html.spec.whatwg.org/#script-data-end-tag-name-state
11:33
<zcorpan>
foolip: you need <script><!-- to enter https://html.spec.whatwg.org/#script-data-escaped-state
11:35
<zcorpan>
foolip: and <script><!--<script> to enter https://html.spec.whatwg.org/#script-data-double-escaped-state (which causes a single </script> to not close the element)
11:38
<zcorpan>
https://html.spec.whatwg.org/#script-data-double-escape-end-state handling > is where </script> is sometimes "ignored"
11:38
<foolip>
Thanks, the <script>console.log("</script>")</script> case is clear now. It does use the temp buffer and if that > inside the string isn't there, it emits a character token for each char in the temp buffer.
11:39
<foolip>
That explains how the parallel lowercase + original case is handled in that case.
11:40
<foolip>
The context is I did https://whatpr.org/html/12118/parsing.html#processing-instruction-target-state yesterday and although it's readable, I think it's an unusual use of the temp buffer that maybe I should undo.
11:40
<snek>
is there a way to have a "back" button/link without js?
11:40
<foolip>
The motivation was to avoid saying "discard the current token" which the tokenizer never does elsewhere (including in Blink's implementation).
11:41
<snek>
i was hoping there would be some obsecure href target or button command or something, but i couldn't find anything
11:41
<foolip>
The problem is there might be unbounded buffering before we know whether we're going to create a PI or a bogus comment, and that we need the data lowercase for the first but the original for the latter...
11:44
<zcorpan>
Hmm right. Maybe the target should preserve case?
11:44
<foolip>
Huh, I didn't even think about that.
11:45
<foolip>
What I did consider was letting tree construction lowercase the target.
11:46
<foolip>
But preserving case is very un-HTML-y isn't it?
11:48
<zcorpan>
foolip: though https://html.spec.whatwg.org/#script-data-escaped-end-tag-name-state creates an end tag with arbitrary length and then throws it away
11:49
<zcorpan>
Except implementations could have a state to check what's up after temporary buffer > "script".length
11:50
<foolip>
Hmm, but it never says "discard the token" or something like, it just emit another token.
11:50
<zcorpan>
foolip: anything else
11:51
<zcorpan>
or > when temporary buffer != "script"
11:52
<foolip>
Right, but seemingly not explicitly saying that the current tag token is no longer current, right?
11:52
<foolip>
Maybe something else is guaranteed to say "create a bla token" before anything can get confused.
11:53
<zcorpan>
Right the end tag token is created but then nothing happens with it
11:54
<foolip>
I see that in Blink, for this case, there's a special buffered_end_tag_name_, so the token is only created once we know what it's going to be.
11:57
<zcorpan>
How much is ok to buffer? For compat with <?lit$ it's just 4 chars
11:58
<foolip>
I checked and it looks like Blink's buffer can grow unbounded. (I expected to find a hardcoded max size that is impossible to exceed because of how it's used, but no.)
11:58
<foolip>
Do you think there are important rules to follow here? Possible principles I've inferred include (1) only put things in the temp buffer to compare them, not to read them back (2) don't create a token unless it will be emitted (3) normalize casing in tokenization, not tree construction and (4) no unbounded buffering.
11:59
<foolip>
I don't think all of these can be hard rules at the same time.
11:59
<zcorpan>
The temporary buffer is sometimes emitted as characters
12:01
<foolip>
Yes, you're right, so taking it and sticking it into comment data feels OK, and what I did.
12:01
<foolip>
Just that the wording is slightly different because for character tokens they are emitted one by one.
12:02
<foolip>
We don't actually need unbounded buffering and lookahead, at most "xml-stylesheet".length + 1 is strictly required.
12:03
<foolip>
No wait, that's wrong. After the recent change there's also <?superduperlong$not-a-pi?>.
12:04
<zcorpan>
Right. But we could place some limit on how many characters to check
12:05
<foolip>
And after that if you see a $ we make it part of the target instead?
12:05
<zcorpan>
Yeah
12:05
<foolip>
The test cases are going to be fun at least :)
12:08
<foolip>
zcorpan: lit$ doesn't show up in httparchive, any idea how we can find out if there are more cases like it?
12:08
<foolip>
use counters?
12:09
<zcorpan>
foolip: My query didn't include PIs with $
12:09
<foolip>
Oh right, these are just the ones that would be valid PIs per our new stricter target names.
12:10
<zcorpan>
Yes. Or https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?usp=sharing did (just required the first char to be a-z), but was only 1% of the data
12:10
<foolip>
Noam Rosenthal: you also have a spreadsheet right?
12:12
<Noam Rosenthal>
Noam Rosenthal: you also have a spreadsheet right?
This one looks more comprehensive but similar
12:16
<zcorpan>
foolip: checking https://github.com/search?q=%2F%3C%5C%3F%5Ba-zA-Z%5D%2F+language%3Ahtml+NOT+%22%3C%3Fphp%22+NOT+%22%3C%3Fxml%22&type=code it looks like sometimes end tags are typoed as e.g. <?a>
12:19
<foolip>
So the concrete options I see are (1) preserve case in PI target (2) buffer both original + lowercased data until we know if it will be a PI or comment (3) buffer only original and then lowercase it when creating the PI token (4) initially create a comment token and replace it with a PI token when we have a valid target.
12:19
<foolip>
I think 2-4 are equivalent, mostly a matter of what fits best with existing patterns.
12:21
<zcorpan>
hsivonen: do you have an opinion? ^
12:21
<foolip>
Oh, and (5) fixed length buffering or lookahead, so that the rules change after N characters are consumed as target.
12:56
<hsivonen>
Is the only downside of preserving case inconsistency with the rest of HTML?
13:04
<zcorpan>
hsivonen: Yeah I think so. Though <![CDATA[ is case-sensitive
13:13
<hsivonen>
Let's preserve the case. Makes everything simpler.
13:17
<Noam Rosenthal>
Does this mean <?marker> needs to be case-sensitive? or I guess we can do a case-insensitive target lookup when selecting them foolip
13:25
<zcorpan>
Noam Rosenthal: Case-sensitive lookup imo
13:30
<Noam Rosenthal>
Noam Rosenthal: Case-sensitive lookup imo
Works for me. Ok if the underlying target lookup is case sensitive 👍
14:31
<foolip>
Yeah, consistency is the reason.
14:32
<foolip>
If we make the target preserve case, how about attribute names?
21:20
<jmdyck>
Ah, okay, so "that are themselves script or SVG script elements" has to modify the second "descendants". Yeah, that makes sense.
23:31
<cwilso>
Hey gang - I just did the minutes-and-agenda-prep shuffle to prepare for next week's WHATNOT meeting (https://github.com/whatwg/html/issues/12141, AMER+EMEA timeslot). Unfortunately I will be on a plane during the meeting, so I will need someone else to step in to chair. Any volunteers?