04:53
<sirisian>
When writing a lexer for ECMAScript how do you decide when to change between the goal symbols? https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar I naively converted them to regex to toy with an idea. https://gist.github.com/sirisian/5c3402ca51a2440f0bc4e5d297269195 (Ignore any mistakes, I plan to redo it). Like I get that you'd start with InputElementHashbangOrRegExp https://regex101.com/r/YYgu1i/1 So the lexer would take tokens until it ran into a TemplateMiddle or TemplateTail. So in that example it takes the "a" then can't consume the "}". Where does one get the context, whether a RegularExpressionLiteral or TemplateMiddle/Tail is permitted? Is this based on the previous tokens? Do you have to like parse as you run the lexer so you'd potentially parse TemplateSpans -> TemplateMiddleList -> TemplateMiddle and this that would mean that's permitted. (And then you'd do the same to see if RegularExpressionLiteral is permitted)?
05:49
<bakkot>
yes, you have to parse as you run the lexer
05:49
<bakkot>
or at least, this is how everyone does it afaik
05:49
<bakkot>

that is what this sentence is getting at:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements.

05:50
<bakkot>
i.e., you can't know how to tokenize (the lexical grammar) without knowing the context from the higher-level parse (the syntactic grammar)
06:30
<Richard Gibson>
my understanding is basically that you start with lexical goal symbol |InputElementHashbangOrRegExp| and syntactic goal symbol being either |Script| or |Module|. Production of an input element from application of that |InputElementHashbangOrRegExp| goal will then limit possibilities in the syntactic grammar to the point where the new lexical goal symbol is determined. For example: if the first input element is a |TemplateHead| `prefix${ then the syntactic grammar has committed to an |ExpressionStatement| and its contained |Expression| starts with a |SubstitutionTemplate| whose aforementioned |TemplateHead| must be followed by an |Expression|. |Expression| can expand to |RegularExpressionLiteral| but not to |TemplateMiddle|, so the new lexical goal symbol is |InputElementRegExp|. If that produces input element |StringLiteral| "foo", then the syntactic grammar has committed the inner |Expression| to a |MemberExpression| starting with that literal as the |PrimaryExpression|, which can be followed by something that extends the |MemberExpression| (i.e., [ or . for member access or ` for a tagged template or a noncommittal |WhiteSpace| or |LineTerminator| or |Comment|), or otherwise by something that extends a containing production (e.g., ( for a call or ?. for an optional chain or / for a division or } to continue the outer template). So that means the next input element can be a |TemplateMiddle| or |TemplateTail| but not a |RegularExpressionLiteral|, and the new lexical goal symbol is |InputElementTemplateTail|. Continue ad nauseam.
06:48
<Richard Gibson>
Timo Tijhof: you can get consistent sorting like newPages.sort( ( a, b ) => (isNaN(a.index) ? Infinity : a.index) - (isNaN(b.index) ? Infinity : b.index) ), but I don't think there's any way to avoid some kind of surrogate value
23:55
<jmdyck>
Richard Gibson re your lexing+parsing description: yup, that sounds about right.