WIP: import scanner.cc - transpile C++ to javascript #6

Closed
milahu wants to merge 2 commits from import-scanner into master
milahu commented 2022-12-16 13:29:48 +01:00 (Migrated from github.com)

manually translating scanner.cc to tokens.js was too boring
so i started a domain-specific C++ to javascript transpiler ...
aka: why write 100 lines, when i can write 1000 lines?

currently this can transpile the trivial case,
where we have only one "entrypoint"
= only one value in enum TokenType

example: tree-sitter-cpp/src/scanner.cc

enum TokenType {
  RAW_STRING_LITERAL,
};

generated: scanner.js
handwritten: @lezer/cpp/src/tokens.js

detail: instead of marker: string i generate delimiter: number[]
maybe my version is 1% faster, because i avoid marker.charCodeAt(i)
but then sizeof(number) > sizeof(char) so ... gotta benchmark

next steps

so currently im trying to solve the complex case
where the Scanner.scan function acts as a "router" for multiple TokenType

probably will be something with treeshaking and function inlining ...

enum TokenType {
  HEREDOC_START,
  HEREDOC_BODY_BEGINNING,
  HEREDOC_BODY_MIDDLE,
  HEREDOC_BODY_END,
  CONCAT,
  // ...
};

struct Scanner {
  bool scan(TSLexer *lexer, const bool *valid_symbols) {
    if (valid_symbols[CONCAT]) {
      return scan_concat(lexer);
    }
    if (valid_symbols[HEREDOC_START]) {
      return scan_heredoc_start(lexer);
    }
    if (valid_symbols[HEREDOC_BODY_BEGINNING] && !heredoc_delimiter.empty() && !started_heredoc) {
      return scan_heredoc_content(lexer, HEREDOC_BODY_BEGINNING, SIMPLE_HEREDOC_BODY);
    }
    if (valid_symbols[HEREDOC_BODY_MIDDLE] && !heredoc_delimiter.empty() && started_heredoc) {
      return scan_heredoc_content(lexer, HEREDOC_BODY_MIDDLE, HEREDOC_BODY_END);
    }
    // ...
  }
}

my code is far from "ready to merge" ... lots of dead code, debug stuff, comments

manually translating [scanner.cc](https://github.com/tree-sitter/tree-sitter-bash/blob/master/src/scanner.cc) to tokens.js was too boring so i started a domain-specific C++ to javascript transpiler ... aka: why write 100 lines, when i can write 1000 lines? currently this can transpile the trivial case, where we have only one "entrypoint" = only one value in enum TokenType example: [tree-sitter-cpp/src/scanner.cc](https://github.com/tree-sitter/tree-sitter-cpp/blob/master/src/scanner.cc) ```cc enum TokenType { RAW_STRING_LITERAL, }; ``` generated: [scanner.js](https://github.com/milahu/lezer-import-tree-sitter/blob/2c7791ed748320c67e58c6450aa0fe835b7e3b4b/test/cases/tree-sitter-cpp/out/actual/scanner.js) handwritten: [@lezer/cpp/src/tokens.js](https://github.com/lezer-parser/cpp/blob/main/src/tokens.js) detail: instead of `marker: string` i generate `delimiter: number[]` maybe my version is 1% faster, because i avoid `marker.charCodeAt(i)` but then `sizeof(number) > sizeof(char)` so ... gotta benchmark **next steps** so currently im trying to solve the complex case where the `Scanner.scan` function acts as a "router" for multiple TokenType probably will be something with treeshaking and function inlining ... ```cc enum TokenType { HEREDOC_START, HEREDOC_BODY_BEGINNING, HEREDOC_BODY_MIDDLE, HEREDOC_BODY_END, CONCAT, // ... }; struct Scanner { bool scan(TSLexer *lexer, const bool *valid_symbols) { if (valid_symbols[CONCAT]) { return scan_concat(lexer); } if (valid_symbols[HEREDOC_START]) { return scan_heredoc_start(lexer); } if (valid_symbols[HEREDOC_BODY_BEGINNING] && !heredoc_delimiter.empty() && !started_heredoc) { return scan_heredoc_content(lexer, HEREDOC_BODY_BEGINNING, SIMPLE_HEREDOC_BODY); } if (valid_symbols[HEREDOC_BODY_MIDDLE] && !heredoc_delimiter.empty() && started_heredoc) { return scan_heredoc_content(lexer, HEREDOC_BODY_MIDDLE, HEREDOC_BODY_END); } // ... } } ``` my code is far from "ready to merge" ... lots of dead code, debug stuff, comments
marijnh commented 2022-12-16 14:29:31 +01:00 (Migrated from github.com)

This is not something that I'm interested in maintaining, so I don't think it should live in this repository.

This is not something that I'm interested in maintaining, so I don't think it should live in this repository.
milahu commented 2022-12-16 14:56:21 +01:00 (Migrated from github.com)
understandable moving to https://github.com/milahu/lezer-parser-import-tree-sitter-scanner

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lezer/import-tree-sitter!6
No description provided.