Skip to content

ConvergentMethods/google-docs-api-fixtures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

23 Ways the Google Docs API Will Silently Corrupt Your Document

If you've built anything that programmatically edits Google Docs, you've hit these. The API doesn't return errors. It doesn't throw exceptions. It accepts your batchUpdate request, returns 200, and silently produces wrong output.

This repository contains 23 reproducible mutation pairs — before state, API request, after state — demonstrating every major failure mode in the Google Docs batchUpdate API. Each was captured against a live Google Doc. Each is deterministically reproducible.

These are not edge cases. These are the normal operations: insert text, add a table, apply bold, create a list. The API makes every single one of them dangerous.


The Core Problem

The Google Docs API operates on absolute UTF-16 index positions. Every character, every structural element, every table cell boundary occupies a numbered position in a flat index space. When you insert 10 characters at position 50, every index after position 50 shifts by 10.

The API does not adjust subsequent requests in a batch. If you send a batchUpdate with three requests, and the first one shifts indices, the second and third requests are now pointing at the wrong locations. The API will happily execute them at those wrong locations. No error. No warning. Corrupted document.

This means:

  • Every insertion invalidates every subsequent index in the batch. You must compute cascading offsets yourself.
  • Every deletion shifts indices backward. Miss one and you delete the wrong paragraph.
  • Compound operations (insert table + fill cells) require reading the document between steps. The first step changes the index landscape in ways you cannot predict without re-reading.
  • UTF-16 code units, not characters. Emoji, CJK characters, and anything outside the Basic Multilingual Plane consume 2 index positions (surrogate pairs). Python's len() will give you the wrong number.

Fixture Structure

fixtures/
  01_plain_text.json              # 13 document fixtures (documents.get snapshots)
  02_heading_hierarchy.json
  ...
  13_bookmarks.json
  manifest.json                   # Document IDs for each fixture

  mutations/                      # 23 mutation pairs
    T1_insert_text_start/
      before.json                 # Full document state before mutation
      request.json                # Exact batchUpdate request(s) sent
      after.json                  # Full document state after mutation
      description.md              # What the mutation does and why it matters

    T2_insert_text_end/
    ...
    N3_delete_named_range/

  validation/                     # 8 multi-step validation scenarios
    insert_text_after_heading/
      compiled_request.json       # The request sequence
      after.json                  # Resulting document state
      result.txt                  # PASS/FAIL
    ...

Every mutation directory contains four files: the document state before, the exact API request, the document state after, and a description of what the operation does. You can diff before.json and after.json to see exactly what changed.


All 23 Failure Modes

Text Operations

ID Operation What you want What goes wrong Severity
T1 Insert text at document start Add a paragraph at the top All subsequent indices in the batch shift by len(text). Second request in batch targets wrong position. Data loss
T2 Insert text at document end Append a paragraph Must find the correct end position. The document's endIndex is not a valid insertion point — the trailing \n occupies it. Inserting at endIndex fails silently or corrupts. Corruption
T3 Insert after a heading Add text under "Revenue Analysis" Requires walking the document tree to find the heading's endIndex. No semantic addressing — only raw indices. Wrong index = text lands in wrong section. Corruption
T4 Insert between paragraphs Add a paragraph between two existing ones Same index arithmetic problem as T1-T3. Must count paragraphs manually (skip the implicit sectionBreak at index 0). Corruption
T5 Replace a section Replace body text under a heading Two-step: delete range, then insert. The delete shifts all subsequent indices. The insert must use the post-delete index. Get it wrong and you overwrite the wrong section. Data loss
T6 Delete a paragraph Remove a paragraph Cannot delete the sectionBreak (index 0-1) or the final trailing \n. Attempting either fails silently. All subsequent indices shift backward — miss one and the next operation corrupts. Data loss
T7 Replace all text Find and replace The only safe text operation. The API handles index arithmetic internally. Zero complexity. Everything else should be this simple. Cosmetic

Formatting Operations

ID Operation What you want What goes wrong Severity
F1 Apply bold Bold a phrase The fields parameter is a required field mask. Omit it and the API silently clears every other style property on the range — font, size, color, everything. The API returns 200. Data loss
F2 Change heading level Promote/demote a heading Google auto-generates a headingId that you cannot predict or control. If your code references heading IDs downstream, they're now stale. Corruption
F4 Change font Set font family and size Range must exclude the trailing \n of the paragraph. Include it and you bleed the style into the next paragraph. No error. Corruption
F5 Add hyperlink Link text to a URL Google silently auto-applies underline and blue foreground color (rgb 0.067, 0.333, 0.8). If you also apply these manually, you double-apply and create a styling conflict that survives link removal. Cosmetic

Structural Operations

ID Operation What you want What goes wrong Severity
S1 Insert table Add a table at a location Insertion index must be strictly less than the segment's endIndex. Insert at the end → undocumented failure. A 3x3 table with 4-character cells consumes 60 index positions. Miscalculate and every subsequent operation in the batch corrupts. Data loss
S2 Add table row Append a row to a table Must know the table's startIndex (not the row's), the row index, and the column index. A new row consumes 1 + cols * 2 indices. These cascade into everything after the table. Corruption
S5 Insert bullet list Create a bulleted list Two-step operation: insert text first, then apply bullet formatting with createParagraphBullets. Send them in the wrong order and the bullet range points at nonexistent text. Corruption
S6 Insert numbered list Create a numbered list Same two-step problem as S5. Numbered list preset is NUMBERED_DECIMAL_ALPHA_ROMAN. Not discoverable from the API — you need the docs. Corruption
S7 Convert paragraphs to list Bullet existing paragraphs Must calculate the exact startIndex and endIndex spanning all target paragraphs. Off by one and you either miss a paragraph or bullet the wrong one. Cosmetic
S8 Insert page break Add a page break A page break creates a two-element paragraph: a pageBreak element + a trailing textRun containing \n. Consumes 2 index positions, not 1. Downstream index arithmetic is wrong if you assume 1. Corruption

Object Operations

ID Operation What you want What goes wrong Severity
O1 Insert image Embed an image Image consumes exactly 1 index position regardless of file size. But Google re-hosts the image — sourceUri and contentUri diverge silently. If you track images by URI, your references break. Cosmetic
O2 Create header/footer Add page header or footer Creates an entirely new index segment with its own independent index space starting at 0 (which is omitted from the JSON due to proto3 default value suppression). If you don't handle missing startIndex, you get KeyError. Corruption
O4 Insert footnote Add a footnote Consumes 1 index in the body. But creates a new footnote segment with its own index space. Compound operation: you must re-read the document after creation to get the footnote segment ID before you can insert content into it. Corruption

Named Range Operations

ID Operation What you want What goes wrong Severity
N1 Create named range Bookmark a section Named ranges store absolute index positions. Any insertion or deletion before the range silently invalidates it. The range doesn't move. Your named range now points at the wrong text. Data loss
N2 Replace named range content Update a bookmarked section Two-step: delete old content, insert new. The delete invalidates the named range entirely. You must re-create it if you need it again. The API does not warn you. Data loss
N3 Delete named range Remove a bookmark The only safe named range operation. No index impact.

The 5 Worst Offenders

These are the failure modes most likely to corrupt real documents in production.

1. The Silent Style Wipe (F1)

You want to bold a phrase. You send:

{
  "updateTextStyle": {
    "range": {"startIndex": 143, "endIndex": 146},
    "textStyle": {"bold": true}
  }
}

The API returns 200. The text is now bold. It is also now in the default font, default size, default color. Every other style property on that range has been silently cleared.

The fix is a fields parameter:

{
  "updateTextStyle": {
    "range": {"startIndex": 143, "endIndex": 146},
    "textStyle": {"bold": true},
    "fields": "bold"
  }
}

The fields parameter is a field mask that tells the API which properties to modify. Omit it and the API interprets that as "set bold to true, set everything else to default." This is documented in a single paragraph buried in the field masks guide. It is not mentioned in the updateTextStyle reference.

Fixture: fixtures/mutations/F1_apply_bold/

2. The Off-By-One Table Insertion (S1)

You want to insert a table at the end of a section. You calculate the insertion index as the section's endIndex. You send:

{
  "insertTable": {
    "location": {"index": 181},
    "rows": 2,
    "columns": 3
  }
}

If that index equals the segment's endIndex, the API returns: "Index 181 must be less than the end index of the referenced segment, 181." This off-by-one behavior is not documented. You must insert at endIndex - 1 — before the trailing \n, not at the segment boundary.

But the real danger is what happens after the table is inserted. A 2x3 empty table consumes approximately 19 index positions (table start + row markers + cell markers + cell newlines). A 3x3 table with 4-character cells consumes 60. Every operation after the table in the same batch is now pointing at an index that is 19-60 positions too low. The API executes them anyway.

Fixture: fixtures/mutations/S1_insert_table/

3. The Cascading Section Replace (T5)

You want to replace the body text under a heading. This is a two-step operation in a single batch:

[
  {"deleteContentRange": {"range": {"startIndex": 116, "endIndex": 181}}},
  {"insertText": {"location": {"index": 116}, "text": "REPLACEMENT SECTION CONTENT.\n"}}
]

The delete removes 65 characters. Every index after position 116 just shifted backward by 65. The insert must use the post-delete index (116 is correct here because we're inserting at the same position we deleted from). But if you had a third operation in this batch targeting, say, the "Conclusion" heading at its original position, that position is now 36 characters earlier than your code thinks it is (65 deleted - 29 inserted = 36 net shift). The API will execute your third operation at the wrong location.

Fixture: fixtures/mutations/T5_replace_section/

4. The Proto3 Zero-Value Trap (O2)

When you create a header or footer, the API creates a new segment with its own index space. The segment's startIndex is 0. But due to proto3 serialization rules, zero-valued fields are omitted from the JSON response. The startIndex field simply does not appear.

{
  "headers": {
    "kix.nznd4d573jt5": {
      "content": [
        {
          "endIndex": 1,
          "paragraph": { ... }
        }
      ]
    }
  }
}

There is no startIndex on that paragraph element. If your code does element["startIndex"], you get a KeyError. If your code does element.get("startIndex") without a default, you get None and your index arithmetic produces garbage. The correct pattern is element.get("startIndex", 0) — every time, for every element, in every segment. This applies to headers, footers, footnotes, and any future segment type.

This is standard proto3 behavior, but Google's API documentation does not mention it in the context of the Docs API. You discover it when your code crashes on the first document with a header.

Fixture: fixtures/mutations/O2_create_header/

5. Named Range Index Drift (N1 + N2)

Named ranges store absolute index positions. They do not update when the document changes.

Create a named range spanning indices 14-99:

{"createNamedRange": {"name": "executive_summary", "range": {"startIndex": 14, "endIndex": 99}}}

Now insert a paragraph at the beginning of the document (index 1). The document content shifts forward by 20 characters. Your named range still says 14-99. It now points at different text than what you bookmarked. The API does not adjust it. The API does not warn you. The next time you read the named range and operate on its indices, you are operating on the wrong content.

It gets worse with N2 (replace named range content): the delete-and-insert invalidates the named range entirely. The range metadata still exists, but its indices are stale. If you need the range again, you must delete it and re-create it with the new indices. The API does not do this for you.

Fixtures: fixtures/mutations/N1_create_named_range/ and fixtures/mutations/N2_replace_named_range_content/


How to Run These Fixtures

Prerequisites

  1. A Google Cloud project with the Google Docs API enabled
  2. An OAuth 2.0 Client ID (Desktop application type)
  3. Python 3.12+
  4. The credentials.json file from Google Cloud Console

Setup

git clone https://github.com/ConvergentMethods/google-docs-api-fixtures.git
cd google-docs-api-fixtures
pip install google-auth google-auth-oauthlib google-api-python-client

Reproducing the Mutations

Each mutation directory contains the exact batchUpdate request that was sent. To reproduce:

  1. Create a test document using the Google Docs API
  2. Apply the request.json contents via batchUpdate
  3. Compare the result with after.json

The before.json file shows the document state before the mutation, so you can reconstruct the starting document. The description.md file explains what the operation does and what to watch for.

Exploring the Fixtures

The document fixtures (01_plain_text.json through 13_bookmarks.json) are raw documents.get responses from the Google Docs API. They cover:

  • Plain text, headings, inline formatting
  • Bullet lists, numbered lists, nested lists
  • Tables (simple and complex)
  • Images, headers, footers, footnotes
  • Named ranges, bookmarks, page breaks
  • A kitchen-sink document combining all of the above

These are useful as reference material for understanding how Google represents document structure in JSON — something the official documentation is remarkably vague about.


Why These Exist

We built Arezzo, a deterministic compiler for Google Docs API operations. It compiles semantic intent ("insert a paragraph after the Revenue heading") into correct batchUpdate request sequences with proper UTF-16 index arithmetic, cascading offset tracking, and OT-compatible mutation ordering.

These fixtures are the empirical foundation that Arezzo was built on. They exist because the only way to understand how the Google Docs API actually behaves is to send requests and inspect the results. The documentation tells you what the API accepts. These fixtures show you what it actually does.


License

MIT - Convergent Methods, LLC

About

23 reproducible failure modes in the Google Docs API — test fixtures showing how batchUpdate silently corrupts documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors