TC39 General on 2024-05-31

04:53	<sirisian>	When writing a lexer for ECMAScript how do you decide when to change between the goal symbols? https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar I naively converted them to regex to toy with an idea. https://gist.github.com/sirisian/5c3402ca51a2440f0bc4e5d297269195 (Ignore any mistakes, I plan to redo it). Like I get that you'd start with InputElementHashbangOrRegExp https://regex101.com/r/YYgu1i/1 So the lexer would take tokens until it ran into a TemplateMiddle or TemplateTail. So in that example it takes the "a" then can't consume the "}". Where does one get the context, whether a RegularExpressionLiteral or TemplateMiddle/Tail is permitted? Is this based on the previous tokens? Do you have to like parse as you run the lexer so you'd potentially parse TemplateSpans -> TemplateMiddleList -> TemplateMiddle and this that would mean that's permitted. (And then you'd do the same to see if RegularExpressionLiteral is permitted)?
05:49	<bakkot>	yes, you have to parse as you run the lexer
05:49	<bakkot>	or at least, this is how everyone does it afaik
05:49	<bakkot>	that is what this sentence is getting at: There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements.
05:50	<bakkot>	i.e., you can't know how to tokenize (the lexical grammar) without knowing the context from the higher-level parse (the syntactic grammar)
06:30	<Richard Gibson>	my understanding is basically that you start with lexical goal symbol \|InputElementHashbangOrRegExp\| and syntactic goal symbol being either \|Script\| or \|Module\|. Production of an input element from application of that \|InputElementHashbangOrRegExp\| goal will then limit possibilities in the syntactic grammar to the point where the new lexical goal symbol is determined. For example: if the first input element is a \|TemplateHead\| `prefix${ then the syntactic grammar has committed to an \|ExpressionStatement\| and its contained \|Expression\| starts with a \|SubstitutionTemplate\| whose aforementioned \|TemplateHead\| must be followed by an \|Expression\|. \|Expression\| can expand to \|RegularExpressionLiteral\| but not to \|TemplateMiddle\|, so the new lexical goal symbol is \|InputElementRegExp\|. If that produces input element \|StringLiteral\| `"foo"`, then the syntactic grammar has committed the inner \|Expression\| to a \|MemberExpression\| starting with that literal as the \|PrimaryExpression\|, which can be followed by something that extends the \|MemberExpression\| (i.e., `[` or `.` for member access or ` for a tagged template or a noncommittal \|WhiteSpace\| or \|LineTerminator\| or \|Comment\|), or otherwise by something that extends a containing production (e.g., `(` for a call or `?.` for an optional chain or `/` for a division or `}` to continue the outer template). So that means the next input element can be a \|TemplateMiddle\| or \|TemplateTail\| but not a \|RegularExpressionLiteral\|, and the new lexical goal symbol is \|InputElementTemplateTail\|. Continue ad nauseam.
06:48	<Richard Gibson>	Timo Tijhof: you can get consistent sorting like `newPages.sort( ( a, b ) => (isNaN(a.index) ? Infinity : a.index) - (isNaN(b.index) ? Infinity : b.index) )`, but I don't think there's any way to avoid some kind of surrogate value
23:55	<jmdyck>	Richard Gibson re your lexing+parsing description: yup, that sounds about right.