Improve search within NFKD normalized text

bezbac commented

2026-04-15 13:08:36 +02:00

First-time contributor

I work at Langfuse, where we use this package, and one of our customers reported a search issue involving Japanese text.

The problem seems to be that SearchCursor can miss matches when NFKD normalization expands a single source character into multiple characters. In particular, partial matches inside that expansion, or matches that cross from one expanded character into the next, were not being returned reliably.

I do not speak Japanese myself, so I added concrete test cases from the customer provided examples rather than make broader assumptions about the text handling.

This would be my first contribution to the CodeMirror codebase, so I would appreciate any direction if you think there is a more idiomatic way to handle this in SearchCursor.

I work at [Langfuse](https://langfuse.com), where we use this package, and one of our customers reported a search issue involving Japanese text. The problem seems to be that `SearchCursor` can miss matches when NFKD normalization expands a single source character into multiple characters. In particular, partial matches inside that expansion, or matches that cross from one expanded character into the next, were not being returned reliably. I do not speak Japanese myself, so I added concrete test cases from the customer provided examples rather than make broader assumptions about the text handling. This would be my first contribution to the CodeMirror codebase, so I would appreciate any direction if you think there is a more idiomatic way to handle this in `SearchCursor`.

bezbac added 2 commits

2026-04-15 13:08:36 +02:00

Add test cases for partial NFKD matches bf2dabf416

Fix partial NFKD matching behavior dda37b5dfd

marijn referenced this pull request from a commit

2026-04-17 10:33:18 +02:00

Report preciseness of search cursor matches

marijn commented

2026-04-17 10:37:09 +02:00

Owner

Thanks for the patches. Unfortunately, doing it like would have a potentially problematic effect on how replace works—if "㌢" is taken to match "ン", replacing "ン" with something else will completely consume the "㌢" character, which I'd consider data loss (you're removing content that wasn't actually matched). So it seems that for searching, this match is desired, but for replacing it, should be skipped.

To that purpose my patch (linked above) adds a precise field to search cursor matches, which is false when one of the sides doesn't correspond to an actual character boundary in the document. It sets up replace commands to look at that flag.

Could you say a bit more about what the code in your patch that looks at extending characters is trying to do?

Thanks for the patches. Unfortunately, doing it like would have a potentially problematic effect on how replace works—if "㌢" is taken to match "ン", replacing "ン" with something else will completely consume the "㌢" character, which I'd consider data loss (you're removing content that wasn't actually matched). So it seems that for searching, this match is desired, but for replacing it, should be skipped. To that purpose my patch (linked above) adds a `precise` field to search cursor matches, which is false when one of the sides doesn't correspond to an actual character boundary in the document. It sets up replace commands to look at that flag. Could you say a bit more about what the code in your patch that looks at extending characters is trying to do?

bezbac commented

2026-04-21 11:35:46 +02:00

Author

First-time contributor

Ah, I had my blinders on and was fully focused on the search use case, so I didn’t consider how my changes would behave during replacement.

I took a quick look at your patch, and it seems to address all the requirements I had in mind. The part I added around extending characters was just an alternative approach to handling character traversal.

Thanks a lot for the quick turnaround and the changes, I really appreciate it!

Ah, I had my blinders on and was fully focused on the search use case, so I didn’t consider how my changes would behave during replacement. I took a quick look at your patch, and it seems to address all the requirements I had in mind. The part I added around extending characters was just an alternative approach to handling character traversal. Thanks a lot for the quick turnaround and the changes, I really appreciate it!

bezbac closed this pull request

2026-04-21 11:35:47 +02:00

marijn commented

2026-04-21 12:50:32 +02:00

Owner

Great. I've tagged this in @codemirror/search 6.7.0

Pull request closed

Please reopen this pull request to perform a merge.

Rows
Columns

Improve search within NFKD normalized text #19

Pull request closed