Empty paragraphs are pasted in with an extra <br> when copied from Google Docs #1511

Open
opened 2025-02-18 14:07:10 +01:00 by ChiriVulpes · 17 comments
ChiriVulpes commented 2025-02-18 14:07:10 +01:00 (Migrated from github.com)

I don't actually know if this is Google Docs's copy format in specific or if this would always occur for empty paragraphs like this.

To reproduce:

  1. Open this document https://docs.google.com/document/d/1eALoiE4ufLdYqKaGYfHrT3cnVHpYW-o1rUzHs2Q1yFY/edit
  2. Ctrl + A, Ctrl + C
  3. Paste into the example editor https://prosemirror.net/
  4. The resulting text is paragraph 1, followed by a paragraph of two blank lines, followed by paragraph 2
I don't actually know if this is Google Docs's copy format in specific or if this would always occur for empty paragraphs like this. ### To reproduce: 1. Open this document https://docs.google.com/document/d/1eALoiE4ufLdYqKaGYfHrT3cnVHpYW-o1rUzHs2Q1yFY/edit 2. Ctrl + A, Ctrl + C 3. Paste into the example editor https://prosemirror.net/ 4. The resulting text is paragraph 1, followed by a paragraph of two blank lines, followed by paragraph 2
marijnh commented 2025-02-18 14:39:04 +01:00 (Migrated from github.com)

The HTML we get in this case looks, if I remove the attributes, like this:

<p><span>Paragraph 1</span></p><br><p><span>Paragraph 2</span></p>

ProseMirror assumes the stray <br> actually stands for a break element, so it parses it to a hard_break node. That node must occur in an inline context, so it gets a parent paragraph. There is a hack in ProseMirror's parser that will drop <br> nodes at the end of their parent block, since those are typically used as placeholders in empty textblocks. But the way it completely replaces the entire paragraph with a loose break node in this situation isn't a common style, and is hard to distinguish from situations where the break node is intended to be part of the document.

The HTML we get in this case looks, if I remove the attributes, like this: <p><span>Paragraph 1</span></p><br><p><span>Paragraph 2</span></p> ProseMirror assumes the stray `<br>` actually stands for a break element, so it parses it to a `hard_break` node. That node must occur in an inline context, so it gets a parent paragraph. There is a hack in ProseMirror's parser that will drop `<br>` nodes at the end of their parent block, since those are typically used as placeholders in empty textblocks. But the way it completely replaces the entire paragraph with a loose break node in this situation isn't a common style, and is hard to distinguish from situations where the break node is intended to be part of the document.
ChiriVulpes commented 2025-02-18 14:46:22 +01:00 (Migrated from github.com)

Hmm. Google at it again... This is going to come up a lot for the community using my application, lots of authors copy paste from GDocs. But maybe if that's not common usage for Prosemirror overall it should be fixed on my end? That would be transformPastedHTML, correct?

Edit: I went ahead and just fixed it on my end. Feel free to do what you like with this issue.

Hmm. Google at it again... This is going to come up a lot for the community using my application, lots of authors copy paste from GDocs. But maybe if that's not common usage for Prosemirror overall it should be fixed on my end? That would be `transformPastedHTML`, correct? Edit: I went ahead and just fixed it on my end. Feel free to do what you like with this issue.
marijnh commented 2025-02-18 17:21:44 +01:00 (Migrated from github.com)

I'm going to leave this open and see if more people are running into it. It's quite possible that this is a recent change in Google Docs—I'm pretty sure that last time I looked, empty paragraphs had a <p> tag around the <br> in their clipboard format.

I'm going to leave this open and see if more people are running into it. It's quite possible that this is a recent change in Google Docs—I'm pretty sure that last time I looked, empty paragraphs had a `<p>` tag around the `<br>` in their clipboard format.
bZichett commented 2025-02-18 17:26:52 +01:00 (Migrated from github.com)

I don't think it is recent but I am not positive. Would appreciate a special case here as I sporadically paste a lot from Google docs.

I don't think it is recent but I am not positive. Would appreciate a special case here as I sporadically paste a lot from Google docs.
moetelo commented 2025-02-20 12:23:58 +01:00 (Migrated from github.com)

Right now, we're using a somewhat messy transformPastedHTML in production to handle pasting from Google Docs, MS Word and OpenOffice. However, it doesn't feel reliable enough as users report issues every few months. Plus, I'm not sure if such logic should be the responsibility of an application using ProseMirror.

It would be great if ProseMirror natively preserved the formatting of content pasted from Google Docs and other Word-like editors.

Right now, we're using a somewhat messy `transformPastedHTML` in production to handle pasting from Google Docs, MS Word and OpenOffice. However, it doesn't feel reliable enough as users report issues every few months. Plus, I'm not sure if such logic should be the responsibility of an application using ProseMirror. It would be great if ProseMirror natively preserved the formatting of content pasted from Google Docs and other Word-like editors.
marijnh commented 2025-02-20 14:17:46 +01:00 (Migrated from github.com)

It would be great if ProseMirror natively preserved the formatting of content pasted from Google Docs and other Word-like editors.

I'm not sure how you expect that to work. Firstly, ProseMirror is schema-agnostic, so it doesn't magically know how the nodes you define would map to whatever equivalent constructs exist in the various word processing systems. Secondly, as you found, these spit out all kinds of completely ludicrous HTML, and in some situations it's not even clear how to extract the semantic meaning from that.

> It would be great if ProseMirror natively preserved the formatting of content pasted from Google Docs and other Word-like editors. I'm not sure how you expect that to work. Firstly, ProseMirror is schema-agnostic, so it doesn't magically know how the nodes you define would map to whatever equivalent constructs exist in the various word processing systems. Secondly, as you found, these spit out all kinds of completely ludicrous HTML, and in some situations it's not even clear how to extract the semantic meaning from that.
zunsthy commented 2025-03-14 14:07:53 +01:00 (Migrated from github.com)

What a hell!

<br> outside of paragraph miss info. If I set line-height, margin-top, margin-bottom for an empty paragraph, the pasted html will disappear all of them. GoogleDocs save paragraph info to first paragraph id attribute. It seems to prevent getting the same content from html in other editors.

What a hell! `<br>` outside of paragraph miss info. If I set line-height, margin-top, margin-bottom for an empty paragraph, the pasted html will disappear all of them. GoogleDocs save paragraph info to first paragraph id attribute. It seems to prevent getting the same content from html in other editors.
ChiriVulpes commented 2025-06-01 17:21:36 +02:00 (Migrated from github.com)

Google seems to always use just a <br> in place of <p><br></p>, I don't actually know how to fix this consistently because it's not just between paragraphs. If I put a heading, then an empty line, then a paragraph, for example, when copy-pasting into ProseMirror that single <br> between the heading and the paragraph becomes <p><br><br></p>. I don't think I can reliably use transformPastedHTML to fix this and I'm not sure what other workarounds I could even employ. Any ideas?

Google seems to always use just a `<br>` in place of `<p><br></p>`, I don't actually know how to fix this consistently because it's not just between paragraphs. If I put a heading, then an empty line, then a paragraph, for example, when copy-pasting into ProseMirror that single `<br>` between the heading and the paragraph becomes `<p><br><br></p>`. I don't think I can reliably use `transformPastedHTML` to fix this and I'm not sure what other workarounds I could even employ. Any ideas?
dominikklein commented 2025-06-12 10:06:03 +02:00 (Migrated from github.com)

When I have for example the following clipboard content:

<p>test</p>
<p><br></p>
<p>test 3</p>

And paste this inside the editor, this leads to multiple line breaks instead of one:

<p>test</p>
<p><br><br class="ProseMirror-trailingBreak"></p>
<p>test 3</p>

Is this expected?

(It also happens when the initial Content has such markup.)

When I have for example the following clipboard content: ``` <p>test</p> <p><br></p> <p>test 3</p> ``` And paste this inside the editor, this leads to multiple line breaks instead of one: ``` <p>test</p> <p><br><br class="ProseMirror-trailingBreak"></p> <p>test 3</p> ``` Is this expected? (It also happens when the initial Content has such markup.)
marijnh commented 2025-06-12 10:31:05 +02:00 (Migrated from github.com)

Is this expected?

Yes. The first <br> was found in the clipboard content, so the editor assumes it refers to an actual hard break that should be part of the content. The second is just there to prevent the browser from collapsing the first one, and is not part of the actual document content.

> Is this expected? Yes. The first `<br>` was found in the clipboard content, so the editor assumes it refers to an actual hard break that should be part of the content. The second is just there to prevent the browser from collapsing the first one, and is not part of the actual document content.
dominikklein commented 2025-06-12 14:11:10 +02:00 (Migrated from github.com)

Thanks for the explanation.

I can not completely follow, but maybe it's because of some lack of understanding.

Why should the browser prevent something, at least in our case, we have two "visible" line breaks because of that (we are using TipTap).

Thanks for the explanation. I can not completely follow, but maybe it's because of some lack of understanding. Why should the browser prevent something, at least in our case, we have two "visible" line breaks because of that (we are using TipTap).
marijnh commented 2025-06-12 14:15:14 +02:00 (Migrated from github.com)

In an editing context, you want the line after a trailing <br> to be visible, so that the user can put the cursor after the break. A <p><br></p> would just display a single line (before the break), so ProseMirror adds a dummy break at the end to make the second line show up, similar to how it adds a dummy to entirely empty textblocks to make them show up at all.

In an editing context, you want the line after a trailing `<br>` to be visible, so that the user can put the cursor after the break. A `<p><br></p>` would just display a single line (before the break), so ProseMirror adds a dummy break at the end to make the second line show up, similar to how it adds a dummy to entirely empty textblocks to make them show up at all.
dominikklein commented 2025-06-12 15:56:57 +02:00 (Migrated from github.com)

So in the end, every paragraph without real text is getting this <br class="ProseMirror-trailingBreak"> part? For the situation when only a <br> exists, it's really strange, because this leads to two visible new lines, and when you look in the source, it's only one. For sure, later in the document content, but this is not what the user is seeing at this moment.

So in the end, every paragraph without real text is getting this `<br class="ProseMirror-trailingBreak">` part? For the situation when only a `<br>` exists, it's really strange, because this leads to two visible new lines, and when you look in the source, it's only one. For sure, later in the document content, but this is not what the user is seeing at this moment.
ChiriVulpes commented 2025-06-12 16:00:27 +02:00 (Migrated from github.com)

Shouldn't the assumption be that places we copy out of are using <br>'s in exactly the way that you're using them? To make empty blocks actually show?

Shouldn't the assumption be that places we copy out of are using `<br>`'s in exactly the way that you're using them? To make empty blocks actually show?
marijnh commented 2025-06-12 16:56:17 +02:00 (Migrated from github.com)

Shouldn't the assumption be that places we copy out of are using <br>'s in exactly the way that you're using them?

That's only appropriate if the content is copied from a (poorly implemented) web editor. If it's regular HTML, or an editor like ProseMirror which is considerate enough to strip off such internal details when you copy, the assumption that such <br> nodes should be dropped doesn't hold.

> Shouldn't the assumption be that places we copy out of are using `<br>`'s in exactly the way that you're using them? That's only appropriate if the content is copied from a (poorly implemented) web editor. If it's regular HTML, or an editor like ProseMirror which is considerate enough to strip off such internal details when you copy, the assumption that such `<br>` nodes should be dropped doesn't hold.
mweidner037 commented 2025-06-12 20:28:42 +02:00 (Migrated from github.com)

I've been investigating extra BRs in the context of https://github.com/ueberdosis/tiptap/issues/1500 (preventing empty paragraphs and lines from "collapsing" in serialized HTML, so that the HTML looks identical to ProseMirror's rendered state) and also noticed this Google Docs behavior. Here is how various editors handle empty paragraphs / lines (all in Mac Firefox, today):

Empty paragraph

To ensure the selection boundary is where I think it is, I always copy/paste an empty paragraph surrounded by non-empty paragraphs, i.e., <p>A</p> <p></p> <p>B</p>.

Google Docs Notion Ckeditor 4.18.0 here prosemirror-view 1.36.0 (via Tiptap 2.9.1)
Copy; empty p is rendered in clipboard HTML as: Loose <br /> Nothing <p><br></p> <p></p>
Paste '' from GDocs (<p>A</p> <br /> <p>B</p>); empty p becomes: (empty p) Nothing p with 2 empty lines
Paste '' from Ckeditor (<p>A</p> <p><br></p> <p>B</p>); empty p becomes: Nothing Nothing
Paste '' from ProseMirror (<p>A</p> <p></p> <p>B</p>); empty p becomes: Nothing Nothing
Serialize to HTML <p>&nbsp;</p> <p></p>

Empty line at end of paragraph

A paragraph that ends in a line break (<br />). E.g. a paragraph that in plain text is "First line\n".

I again copy/paste the paragraph surrounded by non-empty paragraphs.

Google Docs Notion Ckeditor 4.18.0 here prosemirror-view 1.36.0 (via Tiptap 2.9.1)
Copy; empty line is rendered in clipboard HTML as: <span ...><br /><br /></span> * Nothing <br><br> * <br>
Paste '' from GDocs; empty line becomes: (empty line) Paragraph separator (if you copy GDocs paragraphs without line breaks, Notion pastes them as lines in a single paragraph) 2 empty lines
Paste '' from Ckeditor; empty line becomes: Nothing
Paste '' from ProseMirror; empty line becomes: Nothing Nothing Nothing Nothing **
Serialize to HTML <br />\n&nbsp; <br>

* If you don't include the subsequent paragraph in your selection, there is only one BR.

** This might be issue with my test setup or Tiptap's HardBreak extension.

TLDR

  1. There's no way to make ProseMirror play nicely with all of these when copy-pasting.
  2. GDocs and Ckeditor are liberal with BRs in empty spaces, ProseMirror is more conservative, and Notion doesn't try.
  3. GDocs has the "correct" number of BRs in both scenarios - the same as Ckeditor, which pastes into ProseMirror correctly - but does other weird things (loose BR without P / extra span) that foil ProseMirror's strip-trailing-BR logic.
  4. In serialized HTML, Ckeditor props up empty paragraphs/lines with &nbsp;, then strips them out when re-parsing the HTML. ProseMirror doesn't do this, but it's easy enough to wrap Tiptap's getHTML/setContent with some code that does (perhaps using <br /> instead of nbsp). I intend to work around https://github.com/ueberdosis/tiptap/issues/1500 that way in our app.
I've been investigating extra BRs in the context of https://github.com/ueberdosis/tiptap/issues/1500 (preventing empty paragraphs and lines from "collapsing" in serialized HTML, so that the HTML looks identical to ProseMirror's rendered state) and also noticed this Google Docs behavior. Here is how various editors handle empty paragraphs / lines (all in Mac Firefox, today): ### Empty paragraph To ensure the selection boundary is where I think it is, I always copy/paste an empty paragraph surrounded by non-empty paragraphs, i.e., `<p>A</p> <p></p> <p>B</p>`. | | Google Docs | Notion | Ckeditor 4.18.0 [here](https://www.commoncurriculum.com/) | prosemirror-view 1.36.0 (via Tiptap 2.9.1) | | - | - | - |- | - | | Copy; empty p is rendered in clipboard HTML as: | Loose `<br />` | Nothing | `<p><br></p>` | `<p></p>` | | Paste '' from GDocs (`<p>A</p> <br /> <p>B</p>`); empty p becomes: | ✅ (empty p) | Nothing | ✅ | p with 2 empty lines | | Paste '' from Ckeditor (`<p>A</p> <p><br></p> <p>B</p>`); empty p becomes: | Nothing | Nothing | ✅ | ✅ | | Paste '' from ProseMirror (`<p>A</p> <p></p> <p>B</p>`); empty p becomes: | Nothing | Nothing | ✅ | ✅ | | Serialize to HTML | | | `<p>&nbsp;</p>` | `<p></p>` | ### Empty line at end of paragraph A paragraph that ends in a line break (`<br />`). E.g. a paragraph that in plain text is "First line\n". I again copy/paste the paragraph surrounded by non-empty paragraphs. | | Google Docs | Notion | Ckeditor 4.18.0 [here](https://www.commoncurriculum.com/) | prosemirror-view 1.36.0 (via Tiptap 2.9.1) | | - | - | - |- | - | | Copy; empty line is rendered in clipboard HTML as: | `<span ...><br /><br /></span>` * | Nothing | `<br><br>` * | `<br>` | | Paste '' from GDocs; empty line becomes: | ✅ (empty line) | Paragraph separator (if you copy GDocs paragraphs without line breaks, Notion pastes them as lines in a single paragraph) | ✅ | 2 empty lines | | Paste '' from Ckeditor; empty line becomes: | ✅ | Nothing | ✅ | ✅ | | Paste '' from ProseMirror; empty line becomes: | Nothing | Nothing | Nothing | Nothing ** | Serialize to HTML | | | `<br />\n&nbsp;` | `<br>` | Nothing | Nothing | \* If you don't include the subsequent paragraph in your selection, there is only one BR. ** This might be issue with my test setup or Tiptap's HardBreak extension. ## TLDR 1. There's no way to make ProseMirror play nicely with all of these when copy-pasting. 2. GDocs and Ckeditor are liberal with BRs in empty spaces, ProseMirror is more conservative, and Notion doesn't try. 3. GDocs has the "correct" number of BRs in both scenarios - the same as Ckeditor, which pastes into ProseMirror correctly - but does other weird things (loose BR without P / extra span) that foil ProseMirror's strip-trailing-BR logic. 4. In serialized HTML, Ckeditor props up empty paragraphs/lines with `&nbsp;`, then strips them out when re-parsing the HTML. ProseMirror doesn't do this, but it's easy enough to wrap Tiptap's getHTML/setContent with some code that does (perhaps using `<br />` instead of nbsp). I intend to work around https://github.com/ueberdosis/tiptap/issues/1500 that way in our app.
dominikklein commented 2025-06-13 16:19:32 +02:00 (Migrated from github.com)

@marijnh After some playing around and a deeper look, I understand now what you were mentioning. In the end, a difficult situation, and for sure, I would say the source of the clipboard content is already wrong, and they are not doing it in a "correct" way.

So, a workaround would only be to remove this stuff before, but for sure, you can also not be safe, it was an "expected" new line from the user or not.

@marijnh After some playing around and a deeper look, I understand now what you were mentioning. In the end, a difficult situation, and for sure, I would say the source of the clipboard content is already wrong, and they are not doing it in a "correct" way. So, a workaround would only be to remove this stuff before, but for sure, you can also not be safe, it was an "expected" new line from the user or not.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
prosemirror/prosemirror#1511
No description provided.