Alternative approach to nested parsing? #2

New issue

Closed

opened 2021-11-24 11:48:38 +01:00 by ashtonsix · 2 comments

ashtonsix commented

2021-11-24 11:48:38 +01:00

(Migrated from github.com)

Hi, I noticed this HTML snippet:

<script>var x = '</script>'; var y = 'blah';</script>

Produces this parse tree:

Actual Parse Tree

Document(
  Element(
    OpenTag(
      StartTag("<")
      TagName("script")
      EndTag(">")
    )
    ScriptText("var x = '")
    CloseTag(
      StartCloseTag("</")
      TagName("script")
      EndTag(">")
    )
  )
  Text("'; var y = 'blah';")
  MismatchedCloseTag(
    StartCloseTag("</")
    TagName("script")
    EndTag(">")
  )
)

When I would have instead expected a tree like this:

Expected Parse Tree

Document(
  Element(
    OpenTag(
      StartTag("<")
      TagName("script")
      EndTag(">")
    )
    ScriptText("var x = '</script>'; var y = 'blah';")
    CloseTag(
      StartCloseTag("</")
      TagName("script")
      EndTag(">")
    )
  )
)

What about, inside tokens.js:contentTokenizer, starting a new inner parse, and then shifting a content token to the outer HTML parser when the inner parser cannot shift anything AND the outer parser can shift. We would then save the result of the inner parse for later and mount it once the outer parse completes.

Could this be a good way to approach mixed-language parsing in general?

I have some ideas around context-sensitive languages and modular parsers I think would be neat to explore with this approach, but am not 100% sure they have legs yet.

Hi, I noticed this HTML snippet: ```html <script>var x = '</script>'; var y = 'blah';</script> ``` Produces this parse tree: <details> <summary>Actual Parse Tree</summary> ```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = '") CloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) Text("'; var y = 'blah';") MismatchedCloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) ``` </details> When I would have instead expected a tree like this: <details> <summary>Expected Parse Tree</summary> ```txt Document( Element( OpenTag( StartTag("<") TagName("script") EndTag(">") ) ScriptText("var x = '</script>'; var y = 'blah';") CloseTag( StartCloseTag("</") TagName("script") EndTag(">") ) ) ) ``` </details> What about, inside [tokens.js:contentTokenizer](https://github.com/lezer-parser/html/blob/main/src/tokens.js#L159), starting a new inner parse, and then shifting a content token to the outer HTML parser when the inner parser cannot shift anything AND the outer parser can shift. We would then save the result of the inner parse for later and mount it once the outer parse completes. Could this be a good way to approach mixed-language parsing in general? I have some ideas around context-sensitive languages and modular parsers I think would be neat to explore with this approach, but am not 100% sure they have legs yet.

marijnh commented

2021-11-24 13:45:18 +01:00

(Migrated from github.com)

When I would have instead expected a tree like this:

Have you tried this in a browser? Because I'm pretty sure the way browsers parse documents like this corresponds to what you are labeling the 'bad' parse tree.

> When I would have instead expected a tree like this: Have you tried this in a browser? Because I'm pretty sure the way browsers parse documents like this corresponds to what you are labeling the 'bad' parse tree.

ashtonsix commented

2021-11-24 15:08:44 +01:00

(Migrated from github.com)

Ah, I just tried it in a browser and you're totally right. My apologies for taking your time with this.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lezer/html#2

No description provided.

Rows
Columns