arnaugomez commented

(Migrated from github.com)

This PR adds an option to configure the ChangeSet class so that it can find changes in marks and attributes, as well as in nodes.

Approach

Adds a tokenEncoder configuration option to the ChangeSet class. It accepts a TokenEncoder object, which lets developers customize how characters and nodes are encoded for diffing. This lets the developer find changes not only in nodes and content, but also in marks and attributes of their choice.

Adds 3 built-in TokenEncoder objects:

BaseEncoder (default): only encodes the character and node type info inside tokens. Is the equivalent to using the ChangeSet class before this PR.
MarkEncoder: encodes the mark data inside tokens.
AttributeEncoder: encodes the data of marks and attributes inside tokens.

Breaking changes

There are no breaking changes. The default behavior of the ChangeSet class is still the same. It only adds the tokenEncoder configuration options

Changes

Added the configuration option to the ChangeSet class.
Modified the tokens function so that it supports custom token encoders.
Updated the docs

Motivation for this change

At Tiptap, we're building tools to help track the changes made by multiple users in the same document. Some changes might involve only modifying a mark of the document (like, setting a word from normal to bold). We want to be able to detect these changes with the prosemirror-changeset library.

We're also building workflows where an AI assistant edits the document, and the user reviews the changes it made. We want to be able to detect changes in formatting made by the AI, and show them to the user in a diff format.

Please let us know what you think of the approach we took in solving the problem, and if you'd recommend us to solve it in another way. Also let us know if there are any issues in formatting/tests/docs that we should fix. Thanks for reviewing 😄 .

Closes #4

This PR adds an option to configure the `ChangeSet` class so that it can find changes in marks and attributes, as well as in nodes. # Approach Adds a `tokenEncoder` configuration option to the `ChangeSet` class. It accepts a `TokenEncoder` object, which lets developers customize how characters and nodes are encoded for diffing. This lets the developer find changes not only in nodes and content, but also in marks and attributes of their choice. Adds 3 built-in `TokenEncoder` objects: - `BaseEncoder` (default): only encodes the character and node type info inside tokens. Is the equivalent to using the ChangeSet class before this PR. - `MarkEncoder`: encodes the mark data inside tokens. - `AttributeEncoder`: encodes the data of marks and attributes inside tokens. # Breaking changes There are no breaking changes. The default behavior of the `ChangeSet` class is still the same. It only adds the `tokenEncoder` configuration options # Changes - Added the configuration option to the `ChangeSet` class. - Modified the `tokens` function so that it supports custom token encoders. - Updated the docs # Motivation for this change At [Tiptap](https://tiptap.dev/), we're building tools to help track the changes made by multiple users in the same document. Some changes might involve only modifying a mark of the document (like, setting a word from normal to bold). We want to be able to detect these changes with the prosemirror-changeset library. We're also building workflows where [an AI assistant edits the document](https://tiptap.dev/docs/content-ai/capabilities/changes/overview), and the user reviews the changes it made. We want to be able to detect changes in formatting made by the AI, and show them to the user in a diff format. --- Please let us know what you think of the approach we took in solving the problem, and if you'd recommend us to solve it in another way. Also let us know if there are any issues in formatting/tests/docs that we should fix. Thanks for reviewing 😄 . Closes #4

marijnh commented

2025-05-02 18:04:35 +02:00

(Migrated from github.com)

This seems like a good idea, but are you certain we need this encoder abstraction? Can you think of any modes beyond the three you're providing here that could be useful?

Also, I'm a bit worried about the performance of the string concatenation and JSON encoding — these will be run a lot.

This seems like a good idea, but are you certain we need this encoder abstraction? Can you think of any modes beyond the three you're providing here that could be useful? Also, I'm a bit worried about the performance of the string concatenation and JSON encoding — these will be run a lot.

arnaugomez commented

2025-05-02 18:25:28 +02:00

(Migrated from github.com)

Can you think of any modes beyond the three you're providing here that could be useful?

I believe some developers might want to compare some attributes and ignore others, that's why I added the option to define your own encoder.

Also, I'm a bit worried about the performance of the string concatenation and JSON encoding — these will be run a lot.

Yes, I think you're got a point. Especially in the attribute encoder. The JSON.stringify method would run approximately once for each letter and node in the text. This can be a lot, especially if you want the changes to be re-computed on every transaction.

An alternative solution could be: instead of storing the tokens as a string/number, store them as a Token object that can have attributes and metadata. Then, in the TokenEncoder, define an equality function that determines if two tokens are equal. This solution would not involve any JSON.stringify and string concatenation. What do you think of it?

> Can you think of any modes beyond the three you're providing here that could be useful? I believe some developers might want to compare some attributes and ignore others, that's why I added the option to define your own encoder. > Also, I'm a bit worried about the performance of the string concatenation and JSON encoding — these will be run a lot. Yes, I think you're got a point. Especially in the attribute encoder. The `JSON.stringify` method would run approximately once for each letter and node in the text. This can be a lot, especially if you want the changes to be re-computed on every transaction. An alternative solution could be: instead of storing the tokens as a string/number, store them as a Token object that can have attributes and metadata. Then, in the TokenEncoder, define an equality function that determines if two tokens are equal. This solution would not involve any `JSON.stringify` and string concatenation. What do you think of it?

marijnh commented

2025-05-05 09:08:51 +02:00

(Migrated from github.com)

Does attached patch, which includes a compare function in the encoder abstraction (and intentionally doesn't provide any alternative custom encoder implementations) look like it would work for you?

arnaugomez commented

2025-05-05 09:53:01 +02:00

(Migrated from github.com)

Hi Marijn. Thank you very much for your commit. Yes, I think that it would work for us.

I have left a review comment for if you want to consider it.

github.com/ProseMirror/prosemirror-changeset@562c61c674 (r156383467)

Hi Marijn. Thank you very much for your commit. Yes, I think that it would work for us. I have left a review comment for if you want to consider it. https://github.com/ProseMirror/prosemirror-changeset/commit/562c61c674010f88f58be13370551b1f424a8261#r156383467

arnaugomez commented

2025-05-05 10:29:23 +02:00

(Migrated from github.com)

Because you already made a commit with the changes of this PR, I'm closing it.

@marijnh feel free to close issue #21 too, if you think it's resolved.

Because you already made a commit with the changes of this PR, I'm closing it. @marijnh feel free to close issue #21 too, if you think it's resolved.

Pull request closed

This pull request cannot be reopened because the branch was deleted.

Rows
Columns

Support tracking changes in marks and attributes #21

Approach

Breaking changes

Changes

Motivation for this change

Pull request closed