00:00 | <devsnek> | maybe the `of` in a for/of statemenr |
00:00 | <toddobryan> | Even though Punctuator is defined in the lexical grammar, it's not used in the syntactic grammar. Instead you have things like RelationalOperator or EqualityOperator. |
00:00 | <devsnek> | ah you mean when it does bold text |
00:01 | <devsnek> | generally those occurances are described by the lexical grammar, even when it doesn't specifically mention the lexical production in question |
00:01 | <toddobryan> | Exactly. |
00:02 | <toddobryan> | Really, they're not. Like I said, the lexical grammar includes Punctuator. Nowhere is that referred to in the syntactic grammar. |
00:03 | <devsnek> | toddobryan: if it helps, think of them like c enums |
00:04 | <devsnek> | everything used in the syntaric grammar should be well defined in the lexical grammar even it not directly referring to it |
00:04 | <devsnek> | syntactic* |
00:05 | <toddobryan> | And I lied above. RelationalOperator and EqualityOperator don't exist. I created them when I was writing my grammar. In the lexical grammar, they're not distinguishable (all under Punctuator) and in the syntactic grammar, they're literals: https://www.ecma-international.org/ecma-262/#prod-RelationalExpression and https://www.ecma-international.org/ecma-262/#prod-EqualityExpression. |
00:08 | <devsnek> | toddobryan: I'm not sure I understand what the problem is 😅 |
00:08 | <toddobryan> | devsnek: I'd be fine with that, but nowhere in the spec does it specify how/when you should choose between the goal symbols of InputElementDiv, InputElementRegExp, etc. |
00:09 | <devsnek> | it shouldn't ever have to specify how you choose |
00:09 | <devsnek> | there are maybe two places where it does (if statements and some regex thing in annex b) |
00:09 | <devsnek> | there's a pr open to fix the if statement one |
00:09 | <Bakkot> | It does specify: "The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted. The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted. The InputElementTemplateTail goal is used in all syntactic grammar |
00:09 | <Bakkot> | contexts where a TemplateMiddle or a TemplateTail is permitted but a RegularExpressionLiteral is not permitted. In all other contexts, InputElementDiv is used as the lexical goal symbol." |
00:09 | <toddobryan> | OK--here's the basic problem. If the left-hand side of a grammar production never appears in the right-hand sign of another rule, that production is unused and provides no information to the spec. |
00:10 | <toddobryan> | Find me a place where InputElementDiv is used in the syntactic grammar. |
00:12 | <Bakkot> | The information it provides is in how to divide up the source text. |
00:13 | <devsnek> | I think it's described in section 11? |
00:14 | <toddobryan> | But it's not actually used. In the syntactic grammar, the input is referred to explicitly, not the productions that the lexical grammar provides. |
00:16 | <toddobryan> | And since the syntactic grammar does not explicitly use those rules, it's almost impossible to write a tokenizer. (Or at least it has been for me.) |
00:17 | <Bakkot> | The syntactic grammar is not defined over the source text. If it were it would have to deal with whitespace and comments. |
00:18 | <Bakkot> | The point of splitting them is so that the lexical grammar can deal with whitespace and comments, including stuff like "is `/ a /` three tokens or one" |
00:19 | <rkirsling> | to quote the page I linked |
00:19 | <rkirsling> | > In implementations, the syntactic grammar analyzer (“parser”) may call the lexical grammar analyzer (“tokenizer” or “lexer”), passing the goal symbol as a parameter and asking for the next input element suitable for that goal symbol. |
00:20 | <toddobryan> | OK. Fair enough. But how would I know when to use the `InputElementRegExp` production to get the next token? |
00:21 | <toddobryan> | See note 1 here: https://www.ecma-international.org/ecma-262/#prod-GeneratorMethod |
00:21 | <toddobryan> | There is no such note anywhere that references InputElementRegExp. (That I can find. I'd be happy to learn I'm missing something.) |
00:22 | <devsnek> | in terms of concrete implementation you don't usually see any specific "now we're lexing the input" type stuff |
00:22 | <devsnek> | aside from handling hashbang |
00:23 | <Bakkot> | The bit I quoted above says you'd use `InputElementRegExp` to get the next input element "in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted" |
00:24 | <devsnek> | all the implementations I know of use a whitelist of allowed tokens preceding the yield |
00:24 | <devsnek> | there aren't that many |
00:24 | <Bakkot> | *following? |
00:24 | <Bakkot> | rather than preceding |
00:24 | <devsnek> | yeah that lol |
00:26 | <toddobryan> | So, you'd use `InputElementRegExp` when applying `PrimaryExpression`? (I did a quick search, and I think that's the only rule that `RegularExpressionLiteral` appears on the right of.) |
00:27 | <Bakkot> | mmm.... "when applying" doesn't exactly make sense, I think |
00:27 | <Bakkot> | if you've just parsed _part_ of an expression, like `a +`, then you'd use InputElmentRegexp` |
00:27 | <toddobryan> | When trying to satisfy? |
00:28 | <Bakkot> | The spec is kind of written on the assumption that you'll be using a bottom-up parser, probably specifically a shift-reduce parser |
00:28 | <Bakkot> | which doesn't really have a notion of "trying to satisfy" |
00:29 | <Bakkot> | trying to satisfy is more of a top-down thing |
00:29 | <toddobryan> | How can you do a bottom-up parse when what's legal as a token depends on the context you're in? |
00:30 | <Bakkot> | Because you don't have to know exactly which context you're in |
00:30 | <toddobryan> | I'm writing a recursive-descent parser, so I do know which rule I'm trying to apply... |
00:30 | <Bakkot> | Like I said, if you've just parsed `a +`, you know that the following token can be a RegularExpressionLiteral but not a TemplateMiddle or a TemplateTail, so you know to use InputElementRegExp |
00:31 | <Bakkot> | if you've just parsed `yield`, and you're in a template interpolation, you know that the next token could be a RegularExpressionLiteral or a TemplateTail, so you'd use InputElementRegExpOrTemplateTail |
00:31 | <Bakkot> | etc |
00:31 | <Bakkot> | ok, so, backing up a bit |
00:32 | <Bakkot> | Are you specifically interested in having your parser cleave as close as possible to the spec, or are you just trying to write a parser? |
00:32 | <toddobryan> | Is there a way to know, only from the previous tokens, which goal symbol I should be using? |
00:33 | <toddobryan> | def importCall(_yield: Boolean, _await: Boolean) = str("import") ~ elem('(') ~ assignmentExpression(true, _yield, _await) ~ elem(')') |
00:33 | <toddobryan> | There's a sample rule in the parser I've written so far, so I'm sticking pretty close to the spec. :-) |
00:34 | <devsnek> | what language is that |
00:34 | <toddobryan> | Scala. |
00:34 | <Bakkot> | parser combinators, woo |
00:34 | <Bakkot> | Anyway, the answer is to your previous question is yes |
00:35 | <Bakkot> | You know which syntactic contexts you might be in, which means you know if the next token can be a regexp, a template tail, or neither, which means you know which of the goal symbols to use |
00:36 | <Bakkot> | both? InputElementRegExpOrTemplateTail. just regexp? InputElementRegExp. just template middle/tail? InputElementTemplateTail. neither? InputElementDiv. |
00:36 | <devsnek> | you might want to take a look at some existing parsers (acorn, babel, shiftjs) and see how they work |
00:37 | <toddobryan> | OK. I know about TemplateTail, because if I don't have a previous, unclosed TemplateHead, that's out. |
00:37 | <Bakkot> | devsnek ehh, so, the reason I was asking about if toddobryan was interested in sticking to the spec or not was, the spec is written for clarity and precision, not for ease of implementation |
00:37 | <Bakkot> | so all the parser implementations don't look very much like the spec |
00:38 | <Bakkot> | if you're trying to look like the spec you have to use different implementation strategies |
00:38 | <toddobryan> | Is there something as easy as that for whether a Regexp is legal? |
00:39 | <toddobryan> | I'm guessing I'll need a couple of flags during tokenizing that just flip on or off `isRegexpAllowed` and `isTemplateTailAllowed`. |
00:40 | <devsnek> | for regex literals you generally run over them with some very light rule that just basically recognizes / [ and ] |
00:40 | <Bakkot> | mm, not quite as easy as that, I don't think |
00:40 | <devsnek> | and then pass it to the separate regex parser |
00:40 | <Bakkot> | devsnek that happens later, not when trying to decide which lexical goal symbol to use |
00:40 | <Bakkot> | it's a separate concern |
00:40 | <devsnek> | oh they meant if a regex is allowed |
00:40 | <devsnek> | not if it's valid |
00:41 | <Bakkot> | toddobryan basically though a regexp is legal wherever an expression is legal; if the next token can't be an expression (without an intervening semicolon), then you can't have a regexp |
00:41 | <Bakkot> | so, e.g., if you have just finished parsing an expression (other than `yield`), you can't have a semicolon |
00:41 | <Bakkot> | *can't have a regexp, sorry |
00:42 | <Bakkot> | btw if you haven't thought about ASI now is the time at which you'll need to think about ASI |
00:42 | <toddobryan> | ASI? |
00:42 | <devsnek> | also no line terminator here |
00:42 | <Bakkot> | automatic semicolon insertion |
00:42 | <Bakkot> | https://tc39.es/ecma262/#sec-automatic-semicolon-insertion |
00:43 | <toddobryan> | Yeah. That's next on the list. |
00:44 | <toddobryan> | OK. So I guess I need to figure out how to keep track of whether a RegularExpressionLiteral is allowed. |
00:46 | <Bakkot> | I think you always know at any point you'd ask the tokenizer for the next token |
00:46 | <Bakkot> | like the normal way of writing a recursive descent parser, you have, I don't know, parseConditionalExpression or whatever |
00:46 | <toddobryan> | Thanks for humoring me. I now understand what I was missing. |
00:48 | <Bakkot> | and you call parseBinaryExpression and so on, and eventually end up at parsePrimaryExpression, and you're looking at the next token to determine which kind of primary expression it is |
00:48 | <toddobryan> | Well, I was hoping to separate the parser and the tokenizer, but couldn't figure out how to do that without understanding which rules were applicable. |
00:48 | <Bakkot> | and at that point you know that a regexp is legal, so when you ask for the next token, you know to ask for the regexp ones |
00:48 | <Bakkot> | ah, yeah, you can't split them out because you don't know which goal symbol to use without knowing the syntactic context, unfortunately |
00:49 | <Bakkot> | specifically, you don't know if `/` (or `/=`) is going to be the beginning of a regexp or a division without knowing the syntactic context |
00:50 | <Bakkot> | anyway, good luck! if you make progress and are inclined to share you should post it; I would enjoy reading it |
00:51 | <Bakkot> | I maintain a recursive descent JS parser written in Java, which... is a bit of a pain |
00:51 | <toddobryan> | That makes perfect sense. So if I did want to write just a tokenizer, I'd need to include enough syntactic context to disambiguate. |
00:53 | <toddobryan> | I can imagine. I tried creating a parser combinator library in Kotlin and gave up. The thing that makes them so nice in Kotlin is that you have the flexible syntax, but, and this is key, you can pass arguments by name in addition to by reference, so you don't have to deal with things like circular references and such. |
00:53 | <toddobryan> | "so nice in Scala" I meant. |
00:53 | <toddobryan> | Thanks for all the help! |
00:54 | <Bakkot> | Yeah, the Java one does not get to use combinators |
00:54 | <Bakkot> | Well, it could I guess, but it doesn't |
00:55 | <Bakkot> | (it lives in https://github.com/shapesecurity/shift-java/blob/es2018/src/main/java/com/shapesecurity/shift/es2018/parser/GenericParser.java ) |
00:55 | <toddobryan> | Thanks! |
01:00 | <Bakkot> | toddobryan: thinking about it more, for a recursive descent parser I don't think you'd need to actually "keep track" of any state, as such, to know which of the four lexical grammar goal symbols to ask for |
01:00 | <Bakkot> | it's always going to be obvious every time you ask for a token |
01:01 | <toddobryan> | Yeah. It seems to work for parsing without knowing--what I can't do is tokenize. |
01:01 | <Bakkot> | Can't tokenize ahead of parsing, right |
01:02 | <toddobryan> | Because the syntactic grammar rules aren't written in terms of tokens--they're written in terms of input literals in lots of cases. |
01:03 | <Bakkot> | Hm. I guess the way I'd put it is, the input literals are specializations of the nonterminals of the lexical grammar |
01:03 | <Bakkot> | when the syntactic grammar says `if`, for example, that is an IdentifierName which has the contents "if" |
01:04 | <Bakkot> | i.e. it is a particular kind of IdentifierName |
01:04 | <Bakkot> | similarly when it says `(` that is a particular kind of Punctuator, etc |
01:04 | <Bakkot> | you could write your tokenizer over the specializations, rather than over the full set; that is (afaik) what everyone actually does |
01:04 | <toddobryan> | Yeah, I can see interpreting it that way. |
03:15 | <jmdyck> | dang, I missed a grammar discussion |
13:17 | <bendtherules> | I was looking at NamedEvaluation of anon functions within object literals |
13:17 | <bendtherules> | and I was like gotcha - what if this key is a symbol? what will be to the fn name? |
13:17 | <bendtherules> | And then found this - `set name to the string-concatenation of "[", description, and "]"` |
13:17 | <bendtherules> | Now i understand how much of effort and details goes in to the specs. Just wanted to appreciate the contributors. |
13:20 | <bendtherules> | (and i also wonder if there is a story behind this naming?) |
21:21 | <jmdyck> | bendtherules: a story behind the name "NamedEvaluation" you mean? |
23:28 | <devsnek> | bradleymeck: is there a repo for string literal imports |