#13 - dialect selfClosing is not working - parse error at SelfClosingEndTag - lezer/html

milahu commented

2024-02-21 09:44:27 +01:00

(Migrated from github.com)

for a semantic stage using this parser
it is useful to know the difference between ">" and "/>"

by default, both ">" and "/>" are parsed as EndTag
so currently, i need some extra if/then/else logic

i tried to parse "/>" as SelfClosingEndTag
by enabling the selfClosing dialect
but this gives a parse error at "/>"

input: <img><br/>

lezer-parser-html with default config
">" and "/>" produce node 4

0: node 10 = StartTag: "<"
1: node 22 = TagName: "img"
4: node 4 = EndTag: ">"
5: node 10 = StartTag: "<"
6: node 22 = TagName: "br"
8: node 4 = EndTag: "/>"

lezer-parser-html with .configure({ dialect: "selfClosing" })
"/>" gives a parse error

0: node 10 = StartTag: "<"
1: node 22 = TagName: "img"
4: node 4 = EndTag: ">"
5: node 10 = StartTag: "<"
6: node 22 = TagName: "br"
8: node 0 = ⚠: ""
8: node 16 = Text: "/>"

what would tree-sitter-html do?
">" and "/>" produce different nodes by default: node 3 and node 6

0: node 5 = <: "<"
1: node 17 = tag_name: "img"
4: node 3 = >: ">"
5: node 5 = <: "<"
6: node 17 = tag_name: "br"
8: node 6 = />: "/>"

github.com/lezer-parser/html@fa8c9d5810/src/html.grammar (L108-L109)

for a semantic stage using this parser

im using a custom tree walker that returns a sequence of tokens
so when i concat all these tokens, i get the original source text

lezer-parser-html

// https://codereview.stackexchange.com/a/97886/205605
// based on nix-eval-js/src/lezer-parser-nix/src/nix-format.js
/** @param {Tree | TreeNode} tree */
function walkHtmlTree(tree, func) {

  const cursor = tree.cursor();
  //if (!cursor) return '';
  if (!cursor) return;

  let depth = 0;

  while (true) {
    // NLR: Node, Left, Right
    // Node
    const cursorTypeId = cursor.type.id;
    if (
      !(
        cursorTypeId == 15 || // Document
        cursorTypeId == 20 || // Element
        cursorTypeId == 23 || // Attribute
        cursorTypeId == 21 || // OpenTag <script>
        cursorTypeId == 30 || // OpenTag <style>
        cursorTypeId == 36 || // OpenTag
        cursorTypeId == 32 || // CloseTag </style>
        cursorTypeId == 29 || // CloseTag </script>
        cursorTypeId == 37 || // CloseTag
        cursorTypeId == 38 || // SelfClosingTag
        // note: this is inconsistent in the parser
        // InvalidEntity is child node
        // EntityReference is separate node (sibling of other text nodes)
        cursorTypeId == 19 || // InvalidEntity: <a href="?a=1&b=2" -> "&" is parsed as InvalidEntity
        //cursorTypeId == 17 || // EntityReference: "&amp;" or "&mdash;" is parsed as EntityReference
        false
      )
    ) {
      func(cursor)
    }

    // Left
    if (cursor.firstChild()) {
      // moved down
      depth++;
      continue;
    }
    // Right
    if (depth > 0 && cursor.nextSibling()) {
      // moved right
      continue;
    }
    let continueMainLoop = false;
    let firstUp = true;
    while (cursor.parent()) {
      // moved up
      depth--;
      if (depth <= 0) {
        // when tree is a node, stop at the end of node
        // == dont visit sibling or parent nodes
        return;
      }
      if (cursor.nextSibling()) {
        // moved up + right
        continueMainLoop = true;
        break;
      }
      firstUp = false;
    }
    if (continueMainLoop) continue;
    break;
  }
}

import { parser as lezerParserHtml } from '@lezer/html';

const htmlParser = lezerParserHtml.configure({
  //dialect: "selfClosing",
});

const inputHtml = `<img><br/>`;

const htmlTree = htmlParser.parse(inputHtml);

const topNode = htmlTree.topNode;

let lastNodeTo = 0;
walkHtmlTree(topNode, (node) => {
  const nodeSource = inputHtml.slice(lastNodeTo, node.to);
  lastNodeTo = node.to;
  console.log(`${node.from}: node ${node.type.id} = ${node.type.name}: ${JSON.stringify(nodeSource)}`)
});

tree-sitter-html

# https://github.com/tree-sitter/py-tree-sitter/issues/33
#def traverse_tree(tree: Tree):
def walk_html_tree(tree, func):
    ignore_kind_id = [
        25, # fragment
        26, # doctype
        28, # element
        29, # script_element
        30, # style_element
        31, # start_tag
        34, # self_closing_tag
        35, # end_tag
        37, # attribute
        38, # quoted_attribute_value
    ]
    cursor = tree.walk()
    reached_root = False
    while reached_root == False:
        if cursor.node.kind_id not in ignore_kind_id:
            #yield cursor.node
            func(cursor.node)
        if cursor.goto_first_child():
            continue
        if cursor.goto_next_sibling():
            continue
        retracing = True
        while retracing:
            if not cursor.goto_parent():
                retracing = False
                reached_root = True
            if cursor.goto_next_sibling():
                retracing = False

last_node_to = 0

input_html = """<img><br/>"""

def walk_callback(node):
    nonlocal last_node_to
    s = json.dumps(node.text.decode("utf8"))
    print(f"{node.range.start_byte}: node {node.kind_id} = {node.type}: {s}")
    #node_source = input_html[last_node_to:node.range.end_byte]
    last_node_to = node.range.end_byte

import tree_sitter
import tree_sitter_languages

tree_sitter_html = tree_sitter_languages.get_parser("html")
html_parser = tree_sitter_html

html_tree = html_parser.parse(input_html)
top_node = html_tree.root_node

walk_html_tree(top_node, walk_callback)

for a semantic stage using this parser it is useful to know the difference between `">"` and `"/>"` by default, both `">"` and `"/>"` are parsed as `EndTag` so currently, i need some extra if/then/else logic i tried to parse `"/>"` as `SelfClosingEndTag` by enabling the `selfClosing` dialect but this gives a parse error at `"/>"` input: `<img><br/>` lezer-parser-html with default config `">"` and `"/>"` produce node 4 ``` 0: node 10 = StartTag: "<" 1: node 22 = TagName: "img" 4: node 4 = EndTag: ">" 5: node 10 = StartTag: "<" 6: node 22 = TagName: "br" 8: node 4 = EndTag: "/>" ``` lezer-parser-html with `.configure({ dialect: "selfClosing" })` `"/>"` gives a parse error ``` 0: node 10 = StartTag: "<" 1: node 22 = TagName: "img" 4: node 4 = EndTag: ">" 5: node 10 = StartTag: "<" 6: node 22 = TagName: "br" 8: node 0 = ⚠: "" 8: node 16 = Text: "/>" ``` what would `tree-sitter-html` do? `">"` and `"/>"` produce different nodes by default: node 3 and node 6 ``` 0: node 5 = <: "<" 1: node 17 = tag_name: "img" 4: node 3 = >: ">" 5: node 5 = <: "<" 6: node 17 = tag_name: "br" 8: node 6 = />: "/>" ``` https://github.com/lezer-parser/html/blob/fa8c9d581062bbf9d9d018637657a196d4e0cf0e/src/html.grammar#L108-L109 > for a semantic stage using this parser im using a custom tree walker that returns a sequence of tokens so when i concat all these tokens, i get the original source text <details> lezer-parser-html ```js // https://codereview.stackexchange.com/a/97886/205605 // based on nix-eval-js/src/lezer-parser-nix/src/nix-format.js /** @param {Tree | TreeNode} tree */ function walkHtmlTree(tree, func) { const cursor = tree.cursor(); //if (!cursor) return ''; if (!cursor) return; let depth = 0; while (true) { // NLR: Node, Left, Right // Node const cursorTypeId = cursor.type.id; if ( !( cursorTypeId == 15 || // Document cursorTypeId == 20 || // Element cursorTypeId == 23 || // Attribute cursorTypeId == 21 || // OpenTag <script> cursorTypeId == 30 || // OpenTag <style> cursorTypeId == 36 || // OpenTag cursorTypeId == 32 || // CloseTag </style> cursorTypeId == 29 || // CloseTag </script> cursorTypeId == 37 || // CloseTag cursorTypeId == 38 || // SelfClosingTag // note: this is inconsistent in the parser // InvalidEntity is child node // EntityReference is separate node (sibling of other text nodes) cursorTypeId == 19 || // InvalidEntity: <a href="?a=1&b=2" -> "&" is parsed as InvalidEntity //cursorTypeId == 17 || // EntityReference: "&" or "—" is parsed as EntityReference false ) ) { func(cursor) } // Left if (cursor.firstChild()) { // moved down depth++; continue; } // Right if (depth > 0 && cursor.nextSibling()) { // moved right continue; } let continueMainLoop = false; let firstUp = true; while (cursor.parent()) { // moved up depth--; if (depth <= 0) { // when tree is a node, stop at the end of node // == dont visit sibling or parent nodes return; } if (cursor.nextSibling()) { // moved up + right continueMainLoop = true; break; } firstUp = false; } if (continueMainLoop) continue; break; } } import { parser as lezerParserHtml } from '@lezer/html'; const htmlParser = lezerParserHtml.configure({ //dialect: "selfClosing", }); const inputHtml = `<img><br/>`; const htmlTree = htmlParser.parse(inputHtml); const topNode = htmlTree.topNode; let lastNodeTo = 0; walkHtmlTree(topNode, (node) => { const nodeSource = inputHtml.slice(lastNodeTo, node.to); lastNodeTo = node.to; console.log(`${node.from}: node ${node.type.id} = ${node.type.name}: ${JSON.stringify(nodeSource)}`) }); ``` tree-sitter-html ```py # https://github.com/tree-sitter/py-tree-sitter/issues/33 #def traverse_tree(tree: Tree): def walk_html_tree(tree, func): ignore_kind_id = [ 25, # fragment 26, # doctype 28, # element 29, # script_element 30, # style_element 31, # start_tag 34, # self_closing_tag 35, # end_tag 37, # attribute 38, # quoted_attribute_value ] cursor = tree.walk() reached_root = False while reached_root == False: if cursor.node.kind_id not in ignore_kind_id: #yield cursor.node func(cursor.node) if cursor.goto_first_child(): continue if cursor.goto_next_sibling(): continue retracing = True while retracing: if not cursor.goto_parent(): retracing = False reached_root = True if cursor.goto_next_sibling(): retracing = False last_node_to = 0 input_html = """<img><br/>""" def walk_callback(node): nonlocal last_node_to s = json.dumps(node.text.decode("utf8")) print(f"{node.range.start_byte}: node {node.kind_id} = {node.type}: {s}") #node_source = input_html[last_node_to:node.range.end_byte] last_node_to = node.range.end_byte import tree_sitter import tree_sitter_languages tree_sitter_html = tree_sitter_languages.get_parser("html") html_parser = tree_sitter_html html_tree = html_parser.parse(input_html) top_node = html_tree.root_node walk_html_tree(top_node, walk_callback) ``` </details>

marijnh commented

2024-02-21 11:38:04 +01:00

(Migrated from github.com)

Attached patch should help.

👍 1

milahu commented

2024-03-03 11:10:12 +01:00

(Migrated from github.com)

thanks, now "/>" is parsed as SelfClosingEndTag

stupid question: why is the selfClosing dialect not the default behavior?

html can contain arbitrary xml nodes like <custom/>
where i cannot use the node name to detect self-closing nodes

for a semantic stage using this parser it is useful to know the difference between ">" and "/>"

thanks, now `"/>"` is parsed as `SelfClosingEndTag` stupid question: why is the `selfClosing` dialect not the default behavior? html can contain arbitrary xml nodes like `<custom/>` where i cannot use the node name to detect self-closing nodes > for a semantic stage using this parser it is useful to know the difference between `">"` and `"/>"`

marijnh commented

2024-03-03 11:53:27 +01:00

(Migrated from github.com)

html can contain arbitrary xml nodes like <custom/>

HTML ignores the / in that syntax and does not treat this as a self-closing tag. So making the parser treat it as if works by default would be confusing to people.

> html can contain arbitrary xml nodes like `<custom/>` HTML ignores the `/` in that syntax and does _not_ treat this as a self-closing tag. So making the parser treat it as if works by default would be confusing to people.

milahu commented

2024-03-03 17:07:42 +01:00

(Migrated from github.com)

aah, because HTML is a subset of SGML

so the DTD defines void elements which can end with > or />
but actually /> is XML syntax

<style>custom { color: red; }</style>
<div>
  <p>aaa</p>
  <custom/>
    <p>bbb</p> <!-- this is red -->
</div>
<p>ccc</p>

aah, because HTML is a subset of SGML so the DTD defines [void elements](https://html.spec.whatwg.org/multipage/syntax.html#void-elements) which can end with `>` or `/>` but actually `/>` is XML syntax ```html <style>custom { color: red; }</style> <div> <p>aaa</p> <custom/> <p>bbb</p>  </div> <p>ccc</p> ``` related: [Are (non-void) self-closing tags valid in HTML5?](https://stackoverflow.com/questions/3558119/are-non-void-self-closing-tags-valid-in-html5)

Rows
Columns

dialect selfClosing is not working - parse error at SelfClosingEndTag #13