Allow a larger length in encodeArray #2

Merged
poeschko merged 1 commit from patch-1 into master 2020-02-21 12:18:34 +01:00
poeschko commented 2020-02-19 18:22:57 +01:00 (Migrated from github.com)

Complex grammars can apparently reach the current 2^16 limit of values during buildParserFile. The max setting is important for values inside the array, since they're tied to a certain typed array (e.g. Uint16Array), but the length of the array is independent of that, anyway. The corresponding part in decode is https://github.com/lezer-parser/lezer/blob/0.8.0/src/decode.ts#L22

Complex grammars can apparently reach the current 2^16 limit of values during `buildParserFile`. The `max` setting is important for values inside the array, since they're tied to a certain typed array (e.g. `Uint16Array`), but the length of the array is independent of that, anyway. The corresponding part in `decode` is https://github.com/lezer-parser/lezer/blob/0.8.0/src/decode.ts#L22
marijnh commented 2020-02-19 22:02:52 +01:00 (Migrated from github.com)

Some of the data structures in these arrays use internal pointers, which won't work when the array is bigger than 16^2, so I don't think this change is safe. Off the top of my head, I'm sure the tokenizer state machine represents pointers to other states by their offset in the array, not sure what else may be doing something like that.

Some of the data structures in these arrays use internal pointers, which won't work when the array is bigger than 16^2, so I don't think this change is safe. Off the top of my head, I'm sure the tokenizer state machine represents pointers to other states by their offset in the array, not sure what else may be doing something like that.
poeschko commented 2020-02-20 18:53:23 +01:00 (Migrated from github.com)

Wouldn't the encoding of those individual pointers still fail if they're not < 2^16 (i.e. it would be as safe as before this change)?

If there's anything particular you'd want me to test, please just let me know.

FWIW, this change allows me to encode a certain (pretty complex) grammar that wasn't possible to generate before, and it seems to work just fine when using it (no obvious "pointer" issues). It's not really feasible to share that grammar (it depends on a custom tokenizer that depends on many other things), but I could try to write a standalone example to reproduce the size issue, if it helps.

Thanks!

Wouldn't the encoding of those individual pointers still fail if they're not < 2^16 (i.e. it would be as safe as before this change)? If there's anything particular you'd want me to test, please just let me know. FWIW, this change allows me to encode a certain (pretty complex) grammar that wasn't possible to generate before, and it seems to work just fine when using it (no obvious "pointer" issues). It's not really feasible to share that grammar (it depends on a custom tokenizer that depends on many other things), but I could try to write a standalone example to reproduce the size issue, if it helps. Thanks!
marijnh commented 2020-02-21 08:18:09 +01:00 (Migrated from github.com)

Wouldn't the encoding of those individual pointers still fail if they're not < 2^16

Well, they have to be if they are offsets into an array that's less than 2^16 long.

Which of the arrays did overflow for you? At a glance, it looks like the state data doesn't need this restriction, since pointers into that are only stored in the 32-bit state array. The goto table does have this problem, but it might be reasonable to split that up into two arrays, with the index table being 32-bit, to work around that.

> Wouldn't the encoding of those individual pointers still fail if they're not < 2^16 Well, they have to be if they are offsets into an array that's less than 2^16 long. Which of the arrays did overflow for you? At a glance, it looks like the state data doesn't need this restriction, since pointers into that are only stored in the 32-bit state array. The goto table does have this problem, but it might be reasonable to split that up into two arrays, with the index table being 32-bit, to work around that.
poeschko commented 2020-02-21 12:10:47 +01:00 (Migrated from github.com)

It's failing at this line:

  stateData: ${encodeArray(parser.data)},

IMO encodeArray could be kept quite general and doesn't really have to be concerned about the specifics of pointer arrays. If the caller is trying to store pointers that exceed the given max value, encoding those values will fail anyway. So I don't really see a reason to restrict the array length to the same limits as individual values or the hard-coded 0xffff. But it's up to you, of course.

It's failing at this line: ``` stateData: ${encodeArray(parser.data)}, ``` IMO `encodeArray` could be kept quite general and doesn't really have to be concerned about the specifics of pointer arrays. If the caller is trying to store pointers that exceed the given `max` value, encoding those values will fail anyway. So I don't really see a reason to restrict the array length to the same limits as individual values or the hard-coded `0xffff`. But it's up to you, of course.
marijnh commented 2020-02-21 12:17:52 +01:00 (Migrated from github.com)

Oh, you are right about that, there's an additional check per content value so the library won't silently encode overflowing stuff into the array. So your patch should indeed be fine (though I'm not sure how much leeway it gives you before one of the actual pointers values will overflow).

Oh, you are right about that, there's an additional check per content value so the library won't silently encode overflowing stuff into the array. So your patch should indeed be fine (though I'm not sure how much leeway it gives you before one of the actual pointers values will overflow).
marijnh commented 2020-02-21 12:19:39 +01:00 (Migrated from github.com)

Anyway, if you run into goto table overflows, please open an issue, since it's pretty clear how to work around those (though it would involve another breaking change to the serialized parser format). If the tokenizer gets too big, that'd require a more involved change in approach.

Anyway, if you run into goto table overflows, please open an issue, since it's pretty clear how to work around those (though it would involve another breaking change to the serialized parser format). If the tokenizer gets too big, that'd require a more involved change in approach.
poeschko commented 2020-02-21 12:23:12 +01:00 (Migrated from github.com)

Perfect, thanks! I'll definitely keep an eye on it and will let you know.

BTW, thank you so much for this awesome library and your amazing work in general!

Perfect, thanks! I'll definitely keep an eye on it and will let you know. BTW, thank you so much for this awesome library and your amazing work in general!
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lezer/generator!2
No description provided.