Fix reducePos for zero-length contextual tokens #65

Closed
martijnwalraven wants to merge 1 commit from fix-zero-length-reducepos into main
martijnwalraven commented 2026-01-01 05:38:47 +01:00 (Migrated from github.com)

Summary

Zero-length tokens produced by contextual tokenizers should still advance reducePos. When end == this.pos, pos doesn’t change, but reducePos must so that repeat reductions compute correct sizes.

This patch updates reducePos for non-skipped tokens regardless of whether end > this.pos, fixing negative-size repeat nodes and buffer reordering.

Root cause

Stack.shift only updated reducePos inside the end > this.pos branch. For zero-length tokens, pos stays the same and reducePos is left behind. When a +/* repeat reduces, it computes size = reducePos - start, which becomes negative. storeNode then sees a node with end < start and can move it before skipped tokens, leading to TreeBuffer underflow (e.g. comment nodes ending up at 65536).

Reproduction (minimal)

A contextual tokenizer emits a zero-length statementEnd token and a skip token for line comments.

Grammar sketch:

@top Document { statementEnd* Name }
@skip { whitespace | lineTerminator | Comment }
@external tokens { statementEnd @contextual }

Inputs:

  • # Comment\nname → OK
  • # Comment\n\nname → BUG (comment node positioned at 65536; TreeBuffer underflow)

With this change, both parse correctly.

Why this repro matters

statementEnd is used as a disambiguator in a language I’m working on that is not fully newline-sensitive. Line terminators are skipped, and a contextual tokenizer inserts a zero-length statementEnd only when a line break can terminate a statement and the next token does not indicate a line continuation (open delimiters, infix operators, certain keywords, etc.).
The grammar then allows statementEnd* / statementEnd+ between items to tolerate blank lines and comments.

That’s the same shape as the minimal repro: a zero-length statementEnd, skipped comments/newlines, and repeated separators.

Tests

@lezer/lr doesn’t have tests; verified via the minimal repro above.

## Summary Zero-length tokens produced by contextual tokenizers should still advance `reducePos`. When `end == this.pos`, `pos` doesn’t change, but `reducePos` must so that repeat reductions compute correct sizes. This patch updates `reducePos` for non-skipped tokens regardless of whether `end > this.pos`, fixing negative-size repeat nodes and buffer reordering. ## Root cause `Stack.shift` only updated `reducePos` inside the `end > this.pos` branch. For zero-length tokens, `pos` stays the same and `reducePos` is left behind. When a `+`/`*` repeat reduces, it computes `size = reducePos - start`, which becomes negative. `storeNode` then sees a node with `end < start` and can move it before skipped tokens, leading to TreeBuffer underflow (e.g. comment nodes ending up at 65536). ## Reproduction (minimal) A contextual tokenizer emits a zero-length `statementEnd` token and a skip token for line comments. Grammar sketch: ``` @top Document { statementEnd* Name } @skip { whitespace | lineTerminator | Comment } @external tokens { statementEnd @contextual } ``` Inputs: - `# Comment\nname` → OK - `# Comment\n\nname` → BUG (comment node positioned at 65536; TreeBuffer underflow) With this change, both parse correctly. ## Why this repro matters `statementEnd` is used as a disambiguator in a language I’m working on that is not fully newline-sensitive. Line terminators are skipped, and a contextual tokenizer inserts a zero-length `statementEnd` only when a line break can terminate a statement and the next token does not indicate a line continuation (open delimiters, infix operators, certain keywords, etc.). The grammar then allows `statementEnd*` / `statementEnd+` between items to tolerate blank lines and comments. That’s the same shape as the minimal repro: a zero-length `statementEnd`, skipped comments/newlines, and repeated separators. ## Tests `@lezer/lr` doesn’t have tests; verified via the minimal repro above.
marijnh commented 2026-01-01 19:00:17 +01:00 (Migrated from github.com)

This seems like a good idea. Attached patch does this in a slightly simplified way.

This seems like a good idea. Attached patch does this in a slightly simplified way.
marijnh commented 2026-01-06 14:24:38 +01:00 (Migrated from github.com)

This fix caused a regression (JavaScript statements with an inserted semicolon suddenly started covering the whitespace after the statement). Attached patch tweaks it. Could you verify that this doesn't reintroduce the issue you were having?

This fix caused a regression (JavaScript statements with an inserted semicolon suddenly started covering the whitespace after the statement). Attached patch tweaks it. Could you verify that this doesn't reintroduce the issue you were having?

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lezer/lr!65
No description provided.