00:00
<devsnek>
maybe the `of` in a for/of statemenr
00:00
<toddobryan>
Even though Punctuator is defined in the lexical grammar, it's not used in the syntactic grammar. Instead you have things like RelationalOperator or EqualityOperator.
00:00
<devsnek>
ah you mean when it does bold text
00:01
<devsnek>
generally those occurances are described by the lexical grammar, even when it doesn't specifically mention the lexical production in question
00:01
<toddobryan>
Exactly.
00:02
<toddobryan>
Really, they're not. Like I said, the lexical grammar includes Punctuator. Nowhere is that referred to in the syntactic grammar.
00:03
<devsnek>
toddobryan: if it helps, think of them like c enums
00:04
<devsnek>
everything used in the syntaric grammar should be well defined in the lexical grammar even it not directly referring to it
00:04
<devsnek>
syntactic*
00:05
<toddobryan>
And I lied above. RelationalOperator and EqualityOperator don't exist. I created them when I was writing my grammar. In the lexical grammar, they're not distinguishable (all under Punctuator) and in the syntactic grammar, they're literals: https://www.ecma-international.org/ecma-262/#prod-RelationalExpression and https://www.ecma-international.org/ecma-262/#prod-EqualityExpression.
00:08
<devsnek>
toddobryan: I'm not sure I understand what the problem is 😅
00:08
<toddobryan>
devsnek: I'd be fine with that, but nowhere in the spec does it specify how/when you should choose between the goal symbols of InputElementDiv, InputElementRegExp, etc.
00:09
<devsnek>
it shouldn't ever have to specify how you choose
00:09
<devsnek>
there are maybe two places where it does (if statements and some regex thing in annex b)
00:09
<devsnek>
there's a pr open to fix the if statement one
00:09
<Bakkot>
It does specify: "The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted. The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted. The InputElementTemplateTail goal is used in all syntactic grammar
00:09
<Bakkot>
contexts where a TemplateMiddle or a TemplateTail is permitted but a RegularExpressionLiteral is not permitted. In all other contexts, InputElementDiv is used as the lexical goal symbol."
00:09
<toddobryan>
OK--here's the basic problem. If the left-hand side of a grammar production never appears in the right-hand sign of another rule, that production is unused and provides no information to the spec.
00:10
<toddobryan>
Find me a place where InputElementDiv is used in the syntactic grammar.
00:12
<Bakkot>
The information it provides is in how to divide up the source text.
00:13
<devsnek>
I think it's described in section 11?
00:14
<toddobryan>
But it's not actually used. In the syntactic grammar, the input is referred to explicitly, not the productions that the lexical grammar provides.
00:16
<toddobryan>
And since the syntactic grammar does not explicitly use those rules, it's almost impossible to write a tokenizer. (Or at least it has been for me.)
00:17
<Bakkot>
The syntactic grammar is not defined over the source text. If it were it would have to deal with whitespace and comments.
00:18
<Bakkot>
The point of splitting them is so that the lexical grammar can deal with whitespace and comments, including stuff like "is `/ a /` three tokens or one"
00:19
<rkirsling>
to quote the page I linked
00:19
<rkirsling>
> In implementations, the syntactic grammar analyzer (“parser”) may call the lexical grammar analyzer (“tokenizer” or “lexer”), passing the goal symbol as a parameter and asking for the next input element suitable for that goal symbol.
00:20
<toddobryan>
OK. Fair enough. But how would I know when to use the `InputElementRegExp` production to get the next token?
00:21
<toddobryan>
See note 1 here: https://www.ecma-international.org/ecma-262/#prod-GeneratorMethod
00:21
<toddobryan>
There is no such note anywhere that references InputElementRegExp. (That I can find. I'd be happy to learn I'm missing something.)
00:22
<devsnek>
in terms of concrete implementation you don't usually see any specific "now we're lexing the input" type stuff
00:22
<devsnek>
aside from handling hashbang
00:23
<Bakkot>
The bit I quoted above says you'd use `InputElementRegExp` to get the next input element "in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted"
00:24
<devsnek>
all the implementations I know of use a whitelist of allowed tokens preceding the yield
00:24
<devsnek>
there aren't that many
00:24
<Bakkot>
*following?
00:24
<Bakkot>
rather than preceding
00:24
<devsnek>
yeah that lol
00:26
<toddobryan>
So, you'd use `InputElementRegExp` when applying `PrimaryExpression`? (I did a quick search, and I think that's the only rule that `RegularExpressionLiteral` appears on the right of.)
00:27
<Bakkot>
mmm.... "when applying" doesn't exactly make sense, I think
00:27
<Bakkot>
if you've just parsed _part_ of an expression, like `a +`, then you'd use InputElmentRegexp`
00:27
<toddobryan>
When trying to satisfy?
00:28
<Bakkot>
The spec is kind of written on the assumption that you'll be using a bottom-up parser, probably specifically a shift-reduce parser
00:28
<Bakkot>
which doesn't really have a notion of "trying to satisfy"
00:29
<Bakkot>
trying to satisfy is more of a top-down thing
00:29
<toddobryan>
How can you do a bottom-up parse when what's legal as a token depends on the context you're in?
00:30
<Bakkot>
Because you don't have to know exactly which context you're in
00:30
<toddobryan>
I'm writing a recursive-descent parser, so I do know which rule I'm trying to apply...
00:30
<Bakkot>
Like I said, if you've just parsed `a +`, you know that the following token can be a RegularExpressionLiteral but not a TemplateMiddle or a TemplateTail, so you know to use InputElementRegExp
00:31
<Bakkot>
if you've just parsed `yield`, and you're in a template interpolation, you know that the next token could be a RegularExpressionLiteral or a TemplateTail, so you'd use InputElementRegExpOrTemplateTail
00:31
<Bakkot>
etc
00:31
<Bakkot>
ok, so, backing up a bit
00:32
<Bakkot>
Are you specifically interested in having your parser cleave as close as possible to the spec, or are you just trying to write a parser?
00:32
<toddobryan>
Is there a way to know, only from the previous tokens, which goal symbol I should be using?
00:33
<toddobryan>
def importCall(_yield: Boolean, _await: Boolean) = str("import") ~ elem('(') ~ assignmentExpression(true, _yield, _await) ~ elem(')')
00:33
<toddobryan>
There's a sample rule in the parser I've written so far, so I'm sticking pretty close to the spec. :-)
00:34
<devsnek>
what language is that
00:34
<toddobryan>
Scala.
00:34
<Bakkot>
parser combinators, woo
00:34
<Bakkot>
Anyway, the answer is to your previous question is yes
00:35
<Bakkot>
You know which syntactic contexts you might be in, which means you know if the next token can be a regexp, a template tail, or neither, which means you know which of the goal symbols to use
00:36
<Bakkot>
both? InputElementRegExpOrTemplateTail. just regexp? InputElementRegExp. just template middle/tail? InputElementTemplateTail. neither? InputElementDiv.
00:36
<devsnek>
you might want to take a look at some existing parsers (acorn, babel, shiftjs) and see how they work
00:37
<toddobryan>
OK. I know about TemplateTail, because if I don't have a previous, unclosed TemplateHead, that's out.
00:37
<Bakkot>
devsnek ehh, so, the reason I was asking about if toddobryan was interested in sticking to the spec or not was, the spec is written for clarity and precision, not for ease of implementation
00:37
<Bakkot>
so all the parser implementations don't look very much like the spec
00:38
<Bakkot>
if you're trying to look like the spec you have to use different implementation strategies
00:38
<toddobryan>
Is there something as easy as that for whether a Regexp is legal?
00:39
<toddobryan>
I'm guessing I'll need a couple of flags during tokenizing that just flip on or off `isRegexpAllowed` and `isTemplateTailAllowed`.
00:40
<devsnek>
for regex literals you generally run over them with some very light rule that just basically recognizes / [ and ]
00:40
<Bakkot>
mm, not quite as easy as that, I don't think
00:40
<devsnek>
and then pass it to the separate regex parser
00:40
<Bakkot>
devsnek that happens later, not when trying to decide which lexical goal symbol to use
00:40
<Bakkot>
it's a separate concern
00:40
<devsnek>
oh they meant if a regex is allowed
00:40
<devsnek>
not if it's valid
00:41
<Bakkot>
toddobryan basically though a regexp is legal wherever an expression is legal; if the next token can't be an expression (without an intervening semicolon), then you can't have a regexp
00:41
<Bakkot>
so, e.g., if you have just finished parsing an expression (other than `yield`), you can't have a semicolon
00:41
<Bakkot>
*can't have a regexp, sorry
00:42
<Bakkot>
btw if you haven't thought about ASI now is the time at which you'll need to think about ASI
00:42
<toddobryan>
ASI?
00:42
<devsnek>
also no line terminator here
00:42
<Bakkot>
automatic semicolon insertion
00:42
<Bakkot>
https://tc39.es/ecma262/#sec-automatic-semicolon-insertion
00:43
<toddobryan>
Yeah. That's next on the list.
00:44
<toddobryan>
OK. So I guess I need to figure out how to keep track of whether a RegularExpressionLiteral is allowed.
00:46
<Bakkot>
I think you always know at any point you'd ask the tokenizer for the next token
00:46
<Bakkot>
like the normal way of writing a recursive descent parser, you have, I don't know, parseConditionalExpression or whatever
00:46
<toddobryan>
Thanks for humoring me. I now understand what I was missing.
00:48
<Bakkot>
and you call parseBinaryExpression and so on, and eventually end up at parsePrimaryExpression, and you're looking at the next token to determine which kind of primary expression it is
00:48
<toddobryan>
Well, I was hoping to separate the parser and the tokenizer, but couldn't figure out how to do that without understanding which rules were applicable.
00:48
<Bakkot>
and at that point you know that a regexp is legal, so when you ask for the next token, you know to ask for the regexp ones
00:48
<Bakkot>
ah, yeah, you can't split them out because you don't know which goal symbol to use without knowing the syntactic context, unfortunately
00:49
<Bakkot>
specifically, you don't know if `/` (or `/=`) is going to be the beginning of a regexp or a division without knowing the syntactic context
00:50
<Bakkot>
anyway, good luck! if you make progress and are inclined to share you should post it; I would enjoy reading it
00:51
<Bakkot>
I maintain a recursive descent JS parser written in Java, which... is a bit of a pain
00:51
<toddobryan>
That makes perfect sense. So if I did want to write just a tokenizer, I'd need to include enough syntactic context to disambiguate.
00:53
<toddobryan>
I can imagine. I tried creating a parser combinator library in Kotlin and gave up. The thing that makes them so nice in Kotlin is that you have the flexible syntax, but, and this is key, you can pass arguments by name in addition to by reference, so you don't have to deal with things like circular references and such.
00:53
<toddobryan>
"so nice in Scala" I meant.
00:53
<toddobryan>
Thanks for all the help!
00:54
<Bakkot>
Yeah, the Java one does not get to use combinators
00:54
<Bakkot>
Well, it could I guess, but it doesn't
00:55
<Bakkot>
(it lives in https://github.com/shapesecurity/shift-java/blob/es2018/src/main/java/com/shapesecurity/shift/es2018/parser/GenericParser.java )
00:55
<toddobryan>
Thanks!
01:00
<Bakkot>
toddobryan: thinking about it more, for a recursive descent parser I don't think you'd need to actually "keep track" of any state, as such, to know which of the four lexical grammar goal symbols to ask for
01:00
<Bakkot>
it's always going to be obvious every time you ask for a token
01:01
<toddobryan>
Yeah. It seems to work for parsing without knowing--what I can't do is tokenize.
01:01
<Bakkot>
Can't tokenize ahead of parsing, right
01:02
<toddobryan>
Because the syntactic grammar rules aren't written in terms of tokens--they're written in terms of input literals in lots of cases.
01:03
<Bakkot>
Hm. I guess the way I'd put it is, the input literals are specializations of the nonterminals of the lexical grammar
01:03
<Bakkot>
when the syntactic grammar says `if`, for example, that is an IdentifierName which has the contents "if"
01:04
<Bakkot>
i.e. it is a particular kind of IdentifierName
01:04
<Bakkot>
similarly when it says `(` that is a particular kind of Punctuator, etc
01:04
<Bakkot>
you could write your tokenizer over the specializations, rather than over the full set; that is (afaik) what everyone actually does
01:04
<toddobryan>
Yeah, I can see interpreting it that way.
03:15
<jmdyck>
dang, I missed a grammar discussion
13:17
<bendtherules>
I was looking at NamedEvaluation of anon functions within object literals
13:17
<bendtherules>
and I was like gotcha - what if this key is a symbol? what will be to the fn name?
13:17
<bendtherules>
And then found this - `set name to the string-concatenation of "[", description, and "]"`
13:17
<bendtherules>
Now i understand how much of effort and details goes in to the specs. Just wanted to appreciate the contributors.
13:20
<bendtherules>
(and i also wonder if there is a story behind this naming?)
21:21
<jmdyck>
bendtherules: a story behind the name "NamedEvaluation" you mean?
23:28
<devsnek>
bradleymeck: is there a repo for string literal imports