Improve search within NFKD normalized text #19

Open
bezbac wants to merge 2 commits from bezbac/search:main into main
First-time contributor

I work at Langfuse, where we use this package, and one of our customers reported a search issue involving Japanese text.

The problem seems to be that SearchCursor can miss matches when NFKD normalization expands a single source character into multiple characters. In particular, partial matches inside that expansion, or matches that cross from one expanded character into the next, were not being returned reliably.

I do not speak Japanese myself, so I added concrete test cases from the customer provided examples rather than make broader assumptions about the text handling.

This would be my first contribution to the CodeMirror codebase, so I would appreciate any direction if you think there is a more idiomatic way to handle this in SearchCursor.

I work at [Langfuse](https://langfuse.com), where we use this package, and one of our customers reported a search issue involving Japanese text. The problem seems to be that `SearchCursor` can miss matches when NFKD normalization expands a single source character into multiple characters. In particular, partial matches inside that expansion, or matches that cross from one expanded character into the next, were not being returned reliably. I do not speak Japanese myself, so I added concrete test cases from the customer provided examples rather than make broader assumptions about the text handling. This would be my first contribution to the CodeMirror codebase, so I would appreciate any direction if you think there is a more idiomatic way to handle this in `SearchCursor`.
Owner

Thanks for the patches. Unfortunately, doing it like would have a potentially problematic effect on how replace works—if "㌢" is taken to match "ン", replacing "ン" with something else will completely consume the "㌢" character, which I'd consider data loss (you're removing content that wasn't actually matched). So it seems that for searching, this match is desired, but for replacing it, should be skipped.

To that purpose my patch (linked above) adds a precise field to search cursor matches, which is false when one of the sides doesn't correspond to an actual character boundary in the document. It sets up replace commands to look at that flag.

Could you say a bit more about what the code in your patch that looks at extending characters is trying to do?

Thanks for the patches. Unfortunately, doing it like would have a potentially problematic effect on how replace works—if "㌢" is taken to match "ン", replacing "ン" with something else will completely consume the "㌢" character, which I'd consider data loss (you're removing content that wasn't actually matched). So it seems that for searching, this match is desired, but for replacing it, should be skipped. To that purpose my patch (linked above) adds a `precise` field to search cursor matches, which is false when one of the sides doesn't correspond to an actual character boundary in the document. It sets up replace commands to look at that flag. Could you say a bit more about what the code in your patch that looks at extending characters is trying to do?
This pull request has changes conflicting with the target branch.
  • src/cursor.ts
  • test/test-cursor.ts
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u main:bezbac-main
git switch bezbac-main

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff bezbac-main
git switch bezbac-main
git rebase main
git switch main
git merge --ff-only bezbac-main
git switch bezbac-main
git rebase main
git switch main
git merge --no-ff bezbac-main
git switch main
git merge --squash bezbac-main
git switch main
git merge --ff-only bezbac-main
git switch main
git merge bezbac-main
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
codemirror/search!19
No description provided.