Alternative approach to nested parsing? #2

Closed
opened 2021-11-24 11:48:38 +01:00 by ashtonsix · 2 comments
ashtonsix commented 2021-11-24 11:48:38 +01:00 (Migrated from github.com)

Hi, I noticed this HTML snippet:

<script>var x = '</script>'; var y = 'blah';</script>

Produces this parse tree:

Actual Parse Tree
Document(
  Element(
    OpenTag(
      StartTag("<")
      TagName("script")
      EndTag(">")
    )
    ScriptText("var x = '")
    CloseTag(
      StartCloseTag("</")
      TagName("script")
      EndTag(">")
    )
  )
  Text("'; var y = 'blah';")
  MismatchedCloseTag(
    StartCloseTag("</")
    TagName("script")
    EndTag(">")
  )
)

When I would have instead expected a tree like this:

Expected Parse Tree
Document(
  Element(
    OpenTag(
      StartTag("<")
      TagName("script")
      EndTag(">")
    )
    ScriptText("var x = '</script>'; var y = 'blah';")
    CloseTag(
      StartCloseTag("</")
      TagName("script")
      EndTag(">")
    )
  )
)

What about, inside tokens.js:contentTokenizer, starting a new inner parse, and then shifting a content token to the outer HTML parser when the inner parser cannot shift anything AND the outer parser can shift. We would then save the result of the inner parse for later and mount it once the outer parse completes.

Could this be a good way to approach mixed-language parsing in general?

I have some ideas around context-sensitive languages and modular parsers I think would be neat to explore with this approach, but am not 100% sure they have legs yet.

Hi, I noticed this HTML snippet: ```html <script>var x = '</script>'; var y = 'blah';</script> ``` Produces this parse tree: <details> <summary>Actual Parse Tree</summary> ```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = '") CloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) Text("'; var y = 'blah';") MismatchedCloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) ``` </details> When I would have instead expected a tree like this: <details> <summary>Expected Parse Tree</summary> ```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = '</script>'; var y = 'blah';") CloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) ) ``` </details> What about, inside [tokens.js:contentTokenizer](https://github.com/lezer-parser/html/blob/main/src/tokens.js#L159), starting a new inner parse, and then shifting a content token to the outer HTML parser when the inner parser cannot shift anything AND the outer parser can shift. We would then save the result of the inner parse for later and mount it once the outer parse completes. Could this be a good way to approach mixed-language parsing in general? I have some ideas around context-sensitive languages and modular parsers I think would be neat to explore with this approach, but am not 100% sure they have legs yet.
marijnh commented 2021-11-24 13:45:18 +01:00 (Migrated from github.com)

When I would have instead expected a tree like this:

Have you tried this in a browser? Because I'm pretty sure the way browsers parse documents like this corresponds to what you are labeling the 'bad' parse tree.

> When I would have instead expected a tree like this: Have you tried this in a browser? Because I'm pretty sure the way browsers parse documents like this corresponds to what you are labeling the 'bad' parse tree.
ashtonsix commented 2021-11-24 15:08:44 +01:00 (Migrated from github.com)

Ah, I just tried it in a browser and you're totally right. My apologies for taking your time with this.

Ah, I just tried it in a browser and you're totally right. My apologies for taking your time with this.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lezer/html#2
No description provided.