WIP: Add "whole word" option to search/query config #14

Closed
hubgit wants to merge 4 commits from search-whole-word into main
hubgit commented 2022-08-03 15:13:37 +02:00 (Migrated from github.com)

It would be useful to have a "whole word" option for searches. This is particularly useful when replacing, to ensure that parts of words aren't unintentionally replaced.

  • Refactor insideWord and insideWordBoundaries to a separate module, wrapped in isWholeWord.
  • Add wholeWord boolean option to the search extension and the SearchQuery class.
  • Add isValidMatch function, used to filter matches (currently only using the wholeWord option, but could potentially also be used for a "search within selection" option).
    • Pass state around instead of doc, so the state can be used inside the isValidMatch function.

This approach seems to work, but it does require creating state.charCategorizer for every match. It's an opt-in behaviour, so maybe that's acceptable, but I'm completely open to other ways of doing this.

It would be useful to have a "whole word" option for searches. This is particularly useful when replacing, to ensure that parts of words aren't unintentionally replaced. * Refactor `insideWord` and `insideWordBoundaries` to a separate module, wrapped in `isWholeWord`. * Add `wholeWord` boolean option to the `search` extension and the `SearchQuery` class. * Add `isValidMatch` function, used to filter matches (currently only using the `wholeWord` option, but could potentially also be used for a "search within selection" option). * Pass `state` around instead of `doc`, so the state can be used inside the `isValidMatch` function. This approach seems to work, but it does require creating `state.charCategorizer` for every match. It's an opt-in behaviour, so maybe that's acceptable, but I'm completely open to other ways of doing this.
marijnh commented 2022-08-23 15:03:37 +02:00 (Migrated from github.com)

I've gone with an alternative implementation in 790059135d . Does that work for you?

I've gone with an alternative implementation in 790059135db92c858d923a7feb91defb2b7f5564 . Does that work for you?
hubgit commented 2022-08-24 13:14:32 +02:00 (Migrated from github.com)

That seems to work very nicely 👍

I think we might want to change the definition of "whole word" to be "anything that doesn't have a word character directly outside it, on either side", but that will need a bit of verification, so it's up to you if you think that logic makes sense and would prefer to implement it now.

That seems to work very nicely 👍 I _think_ we might want to change the definition of "whole word" to be "anything that doesn't have a word character directly outside it, on either side", but that will need a bit of verification, so it's up to you if you think that logic makes sense and would prefer to implement it now.
aeaton-overleaf commented 2022-09-01 13:57:16 +02:00 (Migrated from github.com)

On closer inspection/testing, there seems to be a bug in the stringWordTest and regexpWordTest implementations:

github.com/codemirror/search@e2d6ffbf1f/src/search.ts (L168-L179)

github.com/codemirror/search@e2d6ffbf1f/src/search.ts (L246-L253)

They should be saying "return true if the characters directly inside are word characters, and the characters directly outside are non-word characters", i.e. this:

return categorizer(charAfter(buf, from - bufPos)) == CharCategory.Word &&
            categorizer(charBefore(buf, to - bufPos)) == CharCategory.Word &&
            categorizer(charBefore(buf, from - bufPos)) != CharCategory.Word &&
            categorizer(charAfter(buf, to - bufPos)) != CharCategory.Word;
On closer inspection/testing, there seems to be a bug in the `stringWordTest` and `regexpWordTest` implementations: https://github.com/codemirror/search/blob/e2d6ffbf1f9d14085c2504ac9312d7ac2d7b4f69/src/search.ts#L168-L179 https://github.com/codemirror/search/blob/e2d6ffbf1f9d14085c2504ac9312d7ac2d7b4f69/src/search.ts#L246-L253 They should be saying "return `true` if the characters directly inside are word characters, and the characters directly outside are non-word characters", i.e. this: ```js return categorizer(charAfter(buf, from - bufPos)) == CharCategory.Word && categorizer(charBefore(buf, to - bufPos)) == CharCategory.Word && categorizer(charBefore(buf, from - bufPos)) != CharCategory.Word && categorizer(charAfter(buf, to - bufPos)) != CharCategory.Word; ```
aeaton-overleaf commented 2022-09-01 14:03:04 +02:00 (Migrated from github.com)

Alternatively, for the more straightforward "anything that doesn't have a word character directly outside it, on either side" definition of a whole word, we could use this:

return categorizer(charBefore(buf, from - bufPos)) != CharCategory.Word &&
            categorizer(charAfter(buf, to - bufPos)) != CharCategory.Word;
Alternatively, for the more straightforward "anything that doesn't have a word character directly outside it, on either side" definition of a whole word, we could use this: ```js return categorizer(charBefore(buf, from - bufPos)) != CharCategory.Word && categorizer(charAfter(buf, to - bufPos)) != CharCategory.Word; ```
marijnh commented 2022-09-01 18:07:30 +02:00 (Migrated from github.com)

They should be saying "return true if the characters directly inside are word characters, and the characters directly outside are non-word characters",

That depends on how you specify this feature. I followed VS Code, which seems to disable the test for matches that don't themselves end/start in word characters.

> They should be saying "return true if the characters directly inside are word characters, and the characters directly outside are non-word characters", That depends on how you specify this feature. I followed VS Code, which seems to disable the test for matches that don't themselves end/start in word characters.
aeaton-overleaf commented 2022-09-02 16:44:56 +02:00 (Migrated from github.com)

I followed VS Code, which seems to disable the test for matches that don't themselves end/start in word characters.

Matching the behaviour of VS Code sounds good in theory (it's what I was aiming for as well), but there seems to be something not quite right in the logic, as it currently matches anything which starts or ends with a non-word character, even if the other end is in the middle of a word.

See demo in https://nz2134.csb.app/ or try once the latest version of @codemirror/search has been deployed there.

> I followed VS Code, which seems to disable the test for matches that don't themselves end/start in word characters. Matching the behaviour of VS Code sounds good in theory (it's what I was aiming for as well), but there seems to be something not quite right in the logic, as it currently matches _anything_ which starts or ends with a non-word character, even if the other end is in the middle of a word. See demo in https://nz2134.csb.app/ or [try](https://codemirror.net/try/?c=aW1wb3J0IHsgRWRpdG9yVmlldywga2V5bWFwIH0gZnJvbSAiQGNvZGVtaXJyb3IvdmlldyI7CmltcG9ydCB7CiAgb3BlblNlYXJjaFBhbmVsLAogIHNlYXJjaCwKICBzZWFyY2hLZXltYXAsCiAgc2V0U2VhcmNoUXVlcnksCiAgU2VhcmNoUXVlcnkKfSBmcm9tICJAY29kZW1pcnJvci9zZWFyY2giOwoKY29uc3QgdmlldyA9IG5ldyBFZGl0b3JWaWV3KHsKICBkb2M6ICIlIFlvdXIgaW50cm9kdWN0aW9uIGdvZXMgaGVyZSFcbiIucmVwZWF0KDEwKSwKICBwYXJlbnQ6IGRvY3VtZW50LmJvZHksCiAgZXh0ZW5zaW9uczogW3NlYXJjaCgpLCBrZXltYXAub2Yoc2VhcmNoS2V5bWFwKV0KfSk7CgpvcGVuU2VhcmNoUGFuZWwodmlldyk7Cgp2aWV3LmRpc3BhdGNoKHsKICBlZmZlY3RzOiBzZXRTZWFyY2hRdWVyeS5vZigKICAgIG5ldyBTZWFyY2hRdWVyeSh7CiAgICAgIHdob2xlV29yZDogdHJ1ZSwKICAgICAgc2VhcmNoOiAiJSBZb3VyIGludHJvZHVjIgogICAgfSkKICApCn0pOwo=) once the latest version of `@codemirror/search` has been deployed there.
marijnh commented 2022-09-02 17:13:09 +02:00 (Migrated from github.com)

as it currently matches anything which starts or ends with a non-word character, even if the other end is in the middle of a word.

That is what I was going for. Are you proposing to only disable the check for sides of the match that don't start/end with a word character?

> as it currently matches anything which starts or ends with a non-word character, even if the other end is in the middle of a word. That is what I was going for. Are you proposing to only disable the check for sides of the match that don't start/end with a word character?
aeaton-overleaf commented 2022-09-02 17:22:13 +02:00 (Migrated from github.com)

I think the behaviour of VS Code is almost what's in https://github.com/codemirror/search/pull/14#issuecomment-1234183763, but it matches anything that doesn't have a word character immediately inside and outside an edge of the match.

I think the behaviour of VS Code is almost what's in https://github.com/codemirror/search/pull/14#issuecomment-1234183763, but it matches anything that doesn't have a word character immediately inside and outside an edge of the match.
marijnh commented 2022-09-02 17:40:22 +02:00 (Migrated from github.com)

How does attached patch look?

How does attached patch look?
aeaton-overleaf commented 2022-09-02 17:40:28 +02:00 (Migrated from github.com)

After a bit more investigation, I think the simplest description of "whole word" might be "does not have a word character on both sides of a boundary"?

After a bit more investigation, I think the simplest description of "whole word" might be "does not have a word character on both sides of a boundary"?
aeaton-overleaf commented 2022-09-02 17:42:27 +02:00 (Migrated from github.com)

How does attached patch look?

I'm interpreting that as "has a non-word character on at least one side of both boundaries", which I think is correct.

> How does attached patch look? I'm interpreting that as "has a non-word character on at least one side of both boundaries", which I think is correct.
marijnh commented 2022-09-02 18:09:20 +02:00 (Migrated from github.com)

Yes, that's what the new grouping of the expression was supposed to make clear -- it only filters them out if either side has both a word character before and a word character after it.

Yes, that's what the new grouping of the expression was supposed to make clear -- it only filters them out if either side has both a word character before and a word character after it.
aeaton-overleaf commented 2022-09-26 11:00:03 +02:00 (Migrated from github.com)

@marijnh This is looking good now, if you wouldn't mind putting out a new release 🙏🏼

@marijnh This is looking good now, if you wouldn't mind putting out a new release 🙏🏼
marijnh commented 2022-09-26 18:27:12 +02:00 (Migrated from github.com)

I've tagged 6.2.1

I've tagged 6.2.1

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
codemirror/search!14
No description provided.