Improve search within NFKD normalized text #19
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bezbac/search:main"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I work at Langfuse, where we use this package, and one of our customers reported a search issue involving Japanese text.
The problem seems to be that
SearchCursorcan miss matches when NFKD normalization expands a single source character into multiple characters. In particular, partial matches inside that expansion, or matches that cross from one expanded character into the next, were not being returned reliably.I do not speak Japanese myself, so I added concrete test cases from the customer provided examples rather than make broader assumptions about the text handling.
This would be my first contribution to the CodeMirror codebase, so I would appreciate any direction if you think there is a more idiomatic way to handle this in
SearchCursor.Thanks for the patches. Unfortunately, doing it like would have a potentially problematic effect on how replace works—if "㌢" is taken to match "ン", replacing "ン" with something else will completely consume the "㌢" character, which I'd consider data loss (you're removing content that wasn't actually matched). So it seems that for searching, this match is desired, but for replacing it, should be skipped.
To that purpose my patch (linked above) adds a
precisefield to search cursor matches, which is false when one of the sides doesn't correspond to an actual character boundary in the document. It sets up replace commands to look at that flag.Could you say a bit more about what the code in your patch that looks at extending characters is trying to do?
Ah, I had my blinders on and was fully focused on the search use case, so I didn’t consider how my changes would behave during replacement.
I took a quick look at your patch, and it seems to address all the requirements I had in mind. The part I added around extending characters was just an alternative approach to handling character traversal.
Thanks a lot for the quick turnaround and the changes, I really appreciate it!
Great. I've tagged this in @codemirror/search 6.7.0
Pull request closed