Allow a larger length in encodeArray #2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "patch-1"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Complex grammars can apparently reach the current 2^16 limit of values during
buildParserFile. Themaxsetting is important for values inside the array, since they're tied to a certain typed array (e.g.Uint16Array), but the length of the array is independent of that, anyway. The corresponding part indecodeis https://github.com/lezer-parser/lezer/blob/0.8.0/src/decode.ts#L22Some of the data structures in these arrays use internal pointers, which won't work when the array is bigger than 16^2, so I don't think this change is safe. Off the top of my head, I'm sure the tokenizer state machine represents pointers to other states by their offset in the array, not sure what else may be doing something like that.
Wouldn't the encoding of those individual pointers still fail if they're not < 2^16 (i.e. it would be as safe as before this change)?
If there's anything particular you'd want me to test, please just let me know.
FWIW, this change allows me to encode a certain (pretty complex) grammar that wasn't possible to generate before, and it seems to work just fine when using it (no obvious "pointer" issues). It's not really feasible to share that grammar (it depends on a custom tokenizer that depends on many other things), but I could try to write a standalone example to reproduce the size issue, if it helps.
Thanks!
Well, they have to be if they are offsets into an array that's less than 2^16 long.
Which of the arrays did overflow for you? At a glance, it looks like the state data doesn't need this restriction, since pointers into that are only stored in the 32-bit state array. The goto table does have this problem, but it might be reasonable to split that up into two arrays, with the index table being 32-bit, to work around that.
It's failing at this line:
IMO
encodeArraycould be kept quite general and doesn't really have to be concerned about the specifics of pointer arrays. If the caller is trying to store pointers that exceed the givenmaxvalue, encoding those values will fail anyway. So I don't really see a reason to restrict the array length to the same limits as individual values or the hard-coded0xffff. But it's up to you, of course.Oh, you are right about that, there's an additional check per content value so the library won't silently encode overflowing stuff into the array. So your patch should indeed be fine (though I'm not sure how much leeway it gives you before one of the actual pointers values will overflow).
Anyway, if you run into goto table overflows, please open an issue, since it's pretty clear how to work around those (though it would involve another breaking change to the serialized parser format). If the tokenizer gets too big, that'd require a more involved change in approach.
Perfect, thanks! I'll definitely keep an eye on it and will let you know.
BTW, thank you so much for this awesome library and your amazing work in general!