translate complex words #5

Merged
milahu merged 5 commits from translate-complex-words into master 2022-12-15 13:16:00 +01:00
milahu commented 2022-12-13 17:00:09 +01:00 (Migrated from github.com)

make it work with tree-sitter-bash

    word: $ => token(seq(
      choice(
        noneOf('#', ...SPECIAL_CHARACTERS),
        seq('\\', noneOf('\\s'))
      ),
      repeat(choice(
        noneOf(...SPECIAL_CHARACTERS),
        seq('\\', noneOf('\\s'))
      ))
    )),

error was

RangeError: Word token too complex

now word is translated to

@tokens {
  Word {
    (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])*
  }
}
make it work with [tree-sitter-bash](https://github.com/tree-sitter/tree-sitter-bash/blob/master/grammar.js) ```js word: $ => token(seq( choice( noneOf('#', ...SPECIAL_CHARACTERS), seq('\\', noneOf('\\s')) ), repeat(choice( noneOf(...SPECIAL_CHARACTERS), seq('\\', noneOf('\\s')) )) )), ``` error was > RangeError: Word token too complex now word is translated to ```lezer @tokens { Word { (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])* } } ```
marijnh commented 2022-12-13 17:08:26 +01:00 (Migrated from github.com)

This looks like it'll put Lezer syntax (as returned by translateExpr) into a JS regular expression. Which doesn't work.

This looks like it'll put Lezer syntax (as returned by `translateExpr`) into a JS regular expression. Which doesn't work.
milahu commented 2022-12-13 17:54:29 +01:00 (Migrated from github.com)

hmm. translateTokenToRegExpr returns the regex

(?:[^#'"<>{}\[\]()`$|&;\\\s]|\[^\s])(?:(?:[^'"<>{}\[\]()`$|&;\\\s]|\[^\s]))*

but the translator still emits

@tokens {
  Word {
    (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])*
  }
}

... but that should work

now most strings are keywords

 UnaryExpression /* precedence: right 0 */ {
-  ("!" | TestOperator) expression
+  (kw<"!"> | TestOperator) expression
 }
 
 PostfixExpression {
-  expression ("++" | "--")
+  expression (kw<"++"> | kw<"--">)
 }
hmm. `translateTokenToRegExpr` returns the regex ``` (?:[^#'"<>{}\[\]()`$|&;\\\s]|\[^\s])(?:(?:[^'"<>{}\[\]()`$|&;\\\s]|\[^\s]))* ``` but the translator still emits ``` @tokens { Word { (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])* } } ``` ... but that should work now most strings are keywords ```diff UnaryExpression /* precedence: right 0 */ { - ("!" | TestOperator) expression + (kw<"!"> | TestOperator) expression } PostfixExpression { - expression ("++" | "--") + expression (kw<"++"> | kw<"--">) } ```
marijnh commented 2022-12-15 11:24:10 +01:00 (Migrated from github.com)

Is this in a form where I can review it?

Is this in a form where I can review it?
milahu commented 2022-12-15 12:35:26 +01:00 (Migrated from github.com)

yes. the error "Word token too complex" is gone, and the result looks ok

curl -L https://github.com/tree-sitter/tree-sitter-bash/raw/master/src/grammar.json >bash-grammar.json
node dist/import-cli.js bash-grammar.json >bash.grammar
bash.grammar
@top Program {
  statements |
  ""
}

statements /* precedence: 1 */ {
  (statement ("\n" HeredocBody)? terminator)* statement ("\n" HeredocBody)? terminator?
}

statements2 {
  (statement ("\n" HeredocBody)? terminator)+
}

terminatedStatement {
  statement terminator
}

statement {
  RedirectedStatement |
  VariableAssignment |
  Command |
  DeclarationCommand |
  UnsetCommand |
  TestCommand |
  NegatedCommand |
  ForStatement |
  CStyleForStatement |
  WhileStatement |
  IfStatement |
  CaseStatement |
  Pipeline |
  List |
  Subshell |
  CompoundStatement |
  FunctionDefinition
}

RedirectedStatement /* precedence: -1 */ {
  statement (FileRedirect | HeredocRedirect | HerestringRedirect)+
}

ForStatement {
  (kw<"for"> | kw<"select">) simpleVariableName (kw<"in"> literal+)? terminator DoGroup
}

CStyleForStatement {
  kw<"for"> "((" expression? terminator expression? terminator expression? "))" ";"? (DoGroup | CompoundStatement)
}

WhileStatement {
  (kw<"while"> | kw<"until">) terminatedStatement DoGroup
}

DoGroup {
  kw<"do"> statements2? kw<"done">
}

IfStatement {
  kw<"if"> terminatedStatement kw<"then"> statements2? ElifClause* ElseClause? kw<"fi">
}

ElifClause {
  kw<"elif"> terminatedStatement kw<"then"> statements2?
}

ElseClause {
  kw<"else"> statements2?
}

CaseStatement {
  kw<"case"> literal terminator? kw<"in"> terminator (CaseItem* CaseItem { LastCaseItem })? kw<"esac">
}

CaseItem {
  literal ("|" literal)* ")" statements? /* precedence: 1 */ (";;" | ";&" | ";;&")
}

LastCaseItem {
  literal ("|" literal)* ")" statements? /* precedence: 1 */ (";;")?
}

FunctionDefinition {
  (kw<"function"> Word ("(" ")")? | Word "(" ")") (CompoundStatement | Subshell | TestCommand)
}

CompoundStatement {
  "{" statements2? "}"
}

Subshell {
  "(" statements ")"
}

Pipeline /* precedence: left 1 */ {
  statement ("|" | "|&") statement
}

List /* precedence: left -1 */ {
  statement ("&&" | "||") statement
}

NegatedCommand {
  kw<"!"> (Command | TestCommand | Subshell)
}

TestCommand {
  ("[" expression "]" | "[[" expression "]]" | "((" expression "))")
}

DeclarationCommand /* precedence: left 0 */ {
  (kw<"declare"> | kw<"typeset"> | kw<"export"> | kw<"readonly"> | kw<"local">) (literal | simpleVariableName | VariableAssignment)*
}

UnsetCommand /* precedence: left 0 */ {
  (kw<"unset"> | kw<"unsetenv">) (literal | simpleVariableName)*
}

Command /* precedence: left 0 */ {
  (VariableAssignment | FileRedirect)* CommandName (literal | (kw<"=~"> | kw<"==">) (literal | Regex))*
}

CommandName {
  literal
}

VariableAssignment {
  (VariableName | Subscript) (kw<"="> | kw<"+=">) (literal | Array | emptyValue)
}

Subscript {
  VariableName "[" literal concat? "]" concat?
}

FileRedirect /* precedence: left 0 */ {
  FileDescriptor? ("<" | ">" | ">>" | "&>" | "&>>" | "<&" | ">&" | ">|") literal
}

HeredocRedirect {
  ("<<" | "<<-") HeredocStart
}

HeredocBody {
  simpleHeredocBody |
  heredocBodyBeginning (Expansion | SimpleExpansion | CommandSubstitution | heredocBodyMiddle)* heredocBodyEnd
}

HerestringRedirect {
  "<<<" literal
}

expression {
  literal |
  UnaryExpression |
  TernaryExpression |
  BinaryExpression |
  PostfixExpression |
  ParenthesizedExpression
}

BinaryExpression /* precedence: left 0 */ {
  expression (kw<"="> | kw<"=="> | kw<"=~"> | kw<"!="> | kw<"+"> | kw<"-"> | kw<"+="> | kw<"-="> | "<" | ">" | "<=" | ">=" | "||" | "&&" | TestOperator) expression |
  expression (kw<"=="> | kw<"=~">) Regex
}

TernaryExpression /* precedence: left 0 */ {
  expression kw<"?"> expression kw<":"> expression
}

UnaryExpression /* precedence: right 0 */ {
  (kw<"!"> | TestOperator) expression
}

PostfixExpression {
  expression (kw<"++"> | kw<"--">)
}

ParenthesizedExpression {
  "(" expression ")"
}

literal {
  Concatenation |
  primaryExpression |
  Word { /* precedence: -2 */ (specialCharacter+) }
}

primaryExpression {
  Word |
  String |
  RawString |
  TranslatedString |
  AnsiCString |
  Expansion |
  SimpleExpansion |
  CommandSubstitution |
  ProcessSubstitution
}

Concatenation /* precedence: -1 */ {
  (primaryExpression | specialCharacter) /* precedence: -1 */ (concat (primaryExpression | specialCharacter))+ (concat "$")?
}

String {
  "\"" (("$"? stringContent | Expansion | SimpleExpansion | CommandSubstitution) concat?)* "$"? "\""
}

TranslatedString {
  "$" String
}

Array {
  "(" literal* ")"
}

SimpleExpansion {
  "$" (simpleVariableName | specialVariableName | SpecialVariableName { kw<"!"> } | SpecialVariableName { "#" })
}

StringExpansion {
  "$" String
}

Expansion {
  "${" ("#" | kw<"!">)? (VariableName kw<"="> literal? | (Subscript | simpleVariableName | specialVariableName) (token_4 Regex?)? (literal | kw<":"> | kw<":?"> | kw<"="> | kw<":-"> | kw<"%"> | kw<"-"> | "#")*)? "}"
}

CommandSubstitution {
  "$(" statements ")" |
  "$(" FileRedirect ")" |
  /* precedence: 1 */ ("`" statements "`")
}

ProcessSubstitution {
  ("<(" | ">(") statements ")"
}

simpleVariableName {
  VariableName { token_5 }
}

specialVariableName {
  SpecialVariableName { kw<"*"> | kw<"@"> | kw<"?"> | kw<"-"> | "$" | kw<"0"> | kw<"_"> }
}

kw<term> { @specialize[@name={term}]<Word, term> }

@skip { Comment | token_1 | token_2 | token_3 }

@external tokens token from "./tokens" { HeredocStart, simpleHeredocBody, heredocBodyBeginning, heredocBodyMiddle, heredocBodyEnd, FileDescriptor, emptyValue, concat, VariableName, Regex }

@tokens {
  token_1 {
    $[ \t\r\n]
  }
  token_2 {
    "\\\\" "\\r"? "\\n"
  }
  token_3 {
    "\\\\" (" " | "\\t" | "\\v" | "\\f")
  }
  specialCharacter /* precedence: -1 */ {
    "{" | "}" | "[" | "]"
  }
  stringContent /* precedence: -1 */ {
    (!["`$\\] | "\\\\" (![\n] | "\\r"? "\\n"))+
  }
  RawString {
    "'" ![']* "'"
  }
  AnsiCString {
    "\\$'" (!['] | "\\\\'")* "'"
  }
  token_4 /* precedence: 1 */ {
    "/"
  }
  Comment /* precedence: -10 */ {
    "#" ![\n]*
  }
  token_5 {
    $[a-zA-Z0-9_]+
  }
  Word {
    (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])*
  }
  TestOperator /* precedence: 1 */ {
    "-" $[a-zA-Z]+
  }
  terminator {
    ";" | ";;" | "\n" | "&"
  }
}

i used the cpp grammar to check for regressions

curl -L https://github.com/tree-sitter/tree-sitter-cpp/raw/master/src/grammar.json >cpp-grammar.json

# after this PR
git checkout 9de164871e6dd75d05c9d90baccd16794b151d0b
npm run build
node dist/import-cli.js cpp-grammar.json >cpp.grammar.2

# before this PR
git checkout 1514bc8487378c7ea1167cfd0b8a31416a7f7509
npm run build
node dist/import-cli.js cpp-grammar.json >cpp.grammar.1
# diff -u cpp.grammar.*
--- cpp.grammar.1
+++ cpp.grammar.2
@@ -1177,7 +1177,7 @@
   NamespaceIdentifier { Identifier }
 }
 
-kw<term> { @specialize[name={term}]<Identifier, term> }
+kw<term> { @specialize[@name={term}]<Identifier, term> }
 
 @skip { token_1 | Comment }
yes. the error "Word token too complex" is gone, and the result looks ok ``` curl -L https://github.com/tree-sitter/tree-sitter-bash/raw/master/src/grammar.json >bash-grammar.json node dist/import-cli.js bash-grammar.json >bash.grammar ``` <details> <summary>bash.grammar</summary> ``` @top Program { statements | "" } statements /* precedence: 1 */ { (statement ("\n" HeredocBody)? terminator)* statement ("\n" HeredocBody)? terminator? } statements2 { (statement ("\n" HeredocBody)? terminator)+ } terminatedStatement { statement terminator } statement { RedirectedStatement | VariableAssignment | Command | DeclarationCommand | UnsetCommand | TestCommand | NegatedCommand | ForStatement | CStyleForStatement | WhileStatement | IfStatement | CaseStatement | Pipeline | List | Subshell | CompoundStatement | FunctionDefinition } RedirectedStatement /* precedence: -1 */ { statement (FileRedirect | HeredocRedirect | HerestringRedirect)+ } ForStatement { (kw<"for"> | kw<"select">) simpleVariableName (kw<"in"> literal+)? terminator DoGroup } CStyleForStatement { kw<"for"> "((" expression? terminator expression? terminator expression? "))" ";"? (DoGroup | CompoundStatement) } WhileStatement { (kw<"while"> | kw<"until">) terminatedStatement DoGroup } DoGroup { kw<"do"> statements2? kw<"done"> } IfStatement { kw<"if"> terminatedStatement kw<"then"> statements2? ElifClause* ElseClause? kw<"fi"> } ElifClause { kw<"elif"> terminatedStatement kw<"then"> statements2? } ElseClause { kw<"else"> statements2? } CaseStatement { kw<"case"> literal terminator? kw<"in"> terminator (CaseItem* CaseItem { LastCaseItem })? kw<"esac"> } CaseItem { literal ("|" literal)* ")" statements? /* precedence: 1 */ (";;" | ";&" | ";;&") } LastCaseItem { literal ("|" literal)* ")" statements? /* precedence: 1 */ (";;")? } FunctionDefinition { (kw<"function"> Word ("(" ")")? | Word "(" ")") (CompoundStatement | Subshell | TestCommand) } CompoundStatement { "{" statements2? "}" } Subshell { "(" statements ")" } Pipeline /* precedence: left 1 */ { statement ("|" | "|&") statement } List /* precedence: left -1 */ { statement ("&&" | "||") statement } NegatedCommand { kw<"!"> (Command | TestCommand | Subshell) } TestCommand { ("[" expression "]" | "[[" expression "]]" | "((" expression "))") } DeclarationCommand /* precedence: left 0 */ { (kw<"declare"> | kw<"typeset"> | kw<"export"> | kw<"readonly"> | kw<"local">) (literal | simpleVariableName | VariableAssignment)* } UnsetCommand /* precedence: left 0 */ { (kw<"unset"> | kw<"unsetenv">) (literal | simpleVariableName)* } Command /* precedence: left 0 */ { (VariableAssignment | FileRedirect)* CommandName (literal | (kw<"=~"> | kw<"==">) (literal | Regex))* } CommandName { literal } VariableAssignment { (VariableName | Subscript) (kw<"="> | kw<"+=">) (literal | Array | emptyValue) } Subscript { VariableName "[" literal concat? "]" concat? } FileRedirect /* precedence: left 0 */ { FileDescriptor? ("<" | ">" | ">>" | "&>" | "&>>" | "<&" | ">&" | ">|") literal } HeredocRedirect { ("<<" | "<<-") HeredocStart } HeredocBody { simpleHeredocBody | heredocBodyBeginning (Expansion | SimpleExpansion | CommandSubstitution | heredocBodyMiddle)* heredocBodyEnd } HerestringRedirect { "<<<" literal } expression { literal | UnaryExpression | TernaryExpression | BinaryExpression | PostfixExpression | ParenthesizedExpression } BinaryExpression /* precedence: left 0 */ { expression (kw<"="> | kw<"=="> | kw<"=~"> | kw<"!="> | kw<"+"> | kw<"-"> | kw<"+="> | kw<"-="> | "<" | ">" | "<=" | ">=" | "||" | "&&" | TestOperator) expression | expression (kw<"=="> | kw<"=~">) Regex } TernaryExpression /* precedence: left 0 */ { expression kw<"?"> expression kw<":"> expression } UnaryExpression /* precedence: right 0 */ { (kw<"!"> | TestOperator) expression } PostfixExpression { expression (kw<"++"> | kw<"--">) } ParenthesizedExpression { "(" expression ")" } literal { Concatenation | primaryExpression | Word { /* precedence: -2 */ (specialCharacter+) } } primaryExpression { Word | String | RawString | TranslatedString | AnsiCString | Expansion | SimpleExpansion | CommandSubstitution | ProcessSubstitution } Concatenation /* precedence: -1 */ { (primaryExpression | specialCharacter) /* precedence: -1 */ (concat (primaryExpression | specialCharacter))+ (concat "$")? } String { "\"" (("$"? stringContent | Expansion | SimpleExpansion | CommandSubstitution) concat?)* "$"? "\"" } TranslatedString { "$" String } Array { "(" literal* ")" } SimpleExpansion { "$" (simpleVariableName | specialVariableName | SpecialVariableName { kw<"!"> } | SpecialVariableName { "#" }) } StringExpansion { "$" String } Expansion { "${" ("#" | kw<"!">)? (VariableName kw<"="> literal? | (Subscript | simpleVariableName | specialVariableName) (token_4 Regex?)? (literal | kw<":"> | kw<":?"> | kw<"="> | kw<":-"> | kw<"%"> | kw<"-"> | "#")*)? "}" } CommandSubstitution { "$(" statements ")" | "$(" FileRedirect ")" | /* precedence: 1 */ ("`" statements "`") } ProcessSubstitution { ("<(" | ">(") statements ")" } simpleVariableName { VariableName { token_5 } } specialVariableName { SpecialVariableName { kw<"*"> | kw<"@"> | kw<"?"> | kw<"-"> | "$" | kw<"0"> | kw<"_"> } } kw<term> { @specialize[@name={term}]<Word, term> } @skip { Comment | token_1 | token_2 | token_3 } @external tokens token from "./tokens" { HeredocStart, simpleHeredocBody, heredocBodyBeginning, heredocBodyMiddle, heredocBodyEnd, FileDescriptor, emptyValue, concat, VariableName, Regex } @tokens { token_1 { $[ \t\r\n] } token_2 { "\\\\" "\\r"? "\\n" } token_3 { "\\\\" (" " | "\\t" | "\\v" | "\\f") } specialCharacter /* precedence: -1 */ { "{" | "}" | "[" | "]" } stringContent /* precedence: -1 */ { (!["`$\\] | "\\\\" (![\n] | "\\r"? "\\n"))+ } RawString { "'" ![']* "'" } AnsiCString { "\\$'" (!['] | "\\\\'")* "'" } token_4 /* precedence: 1 */ { "/" } Comment /* precedence: -10 */ { "#" ![\n]* } token_5 { $[a-zA-Z0-9_]+ } Word { (![#'"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r]) (!['"<>{}\[\]()`$|&;\\ \t\n\r] | "\\" ![ \t\n\r])* } TestOperator /* precedence: 1 */ { "-" $[a-zA-Z]+ } terminator { ";" | ";;" | "\n" | "&" } } ``` </details> i used the cpp grammar to check for regressions ``` curl -L https://github.com/tree-sitter/tree-sitter-cpp/raw/master/src/grammar.json >cpp-grammar.json # after this PR git checkout 9de164871e6dd75d05c9d90baccd16794b151d0b npm run build node dist/import-cli.js cpp-grammar.json >cpp.grammar.2 # before this PR git checkout 1514bc8487378c7ea1167cfd0b8a31416a7f7509 npm run build node dist/import-cli.js cpp-grammar.json >cpp.grammar.1 ``` ```diff # diff -u cpp.grammar.* --- cpp.grammar.1 +++ cpp.grammar.2 @@ -1177,7 +1177,7 @@ NamespaceIdentifier { Identifier } } -kw<term> { @specialize[name={term}]<Identifier, term> } +kw<term> { @specialize[@name={term}]<Identifier, term> } @skip { token_1 | Comment } ```
marijnh commented 2022-12-15 13:24:17 +01:00 (Migrated from github.com)

Thanks. Merged and followed up with b739473

Thanks. Merged and followed up with b739473
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lezer/import-tree-sitter!5
No description provided.