Skip to content

Replace hand-rolled MarkdownToHTML with goldmark#415

Open
jeremy wants to merge 1 commit intomainfrom
goldmark
Open

Replace hand-rolled MarkdownToHTML with goldmark#415
jeremy wants to merge 1 commit intomainfrom
goldmark

Conversation

@jeremy
Copy link
Copy Markdown
Member

@jeremy jeremy commented Apr 9, 2026

Summary

  • Replace the ~350-line regex pipeline (convertInline, placeholder extraction/restoration,
    line-by-line parser) with goldmark configured for Trix editor compatibility via three custom
    components: trixTransformer (tight lists, soft→hard breaks, paragraph spacing),
    trixRenderer (compact blockquotes, HTML escaping, TrixBreak), and escapedAtParser
    (\@ mention suppression)
  • Fix HTMLToMarkdown round-trip fidelity: blockquote and list handlers now normalize <br>
    before splitting, so multiline blockquotes and list continuations survive the edit loop.
    Also fix multiline regex matching ((?i)(?is)) and trim trailing newlines from code
    fence content
  • Promote goldmark from indirect to direct dependency (v1.7.13, already present via glamour)

Net: -433/+452 lines across 3 files. The hand-rolled parser shrinks by ~300 lines; goldmark
components add ~200; tests add ~200 with exact round-trip assertions.

Behavioral changes

  • Lazy list continuation (CommonMark §5.4): - Two\nAfter keeps After inside the list
    item instead of starting a new paragraph. Rare in CLI input; CommonMark-correct.
  • Setext headings: Text\n--- is now an h2 (CommonMark), not paragraph + hr.
  • Escaped % in URLs: \% produces % not %25 (CommonMark-correct).
  • Code blocks: trailing \n inside <code> (goldmark standard fenced code output).
  • Consecutive lines: soft line break (\n) instead of space join — HTML renders identically.

Test plan

  • bin/ci green (all checks pass; tidy-check requires committed state)
  • Existing TestMarkdownToHTML / TestMarkdownToHTMLBackslashEscapes expectations updated
  • New tests: TestMarkdownToHTMLBackslashAtCounts (4 backslash counts),
    TestMarkdownToHTMLMultiParagraphBlockquote (single/multi/multi-para),
    TestMarkdownToHTMLRawHTMLBlock (single/multiline script),
    TestHTMLToMarkdownMultilineBlockquote, TestHTMLToMarkdownMultilineParagraph,
    TestHTMLToMarkdownCodeFenceNewline
  • TestEditLoopRoundTrip with exact assertions: blockquotes (3 variants), lists (plain +
    continuation), code fences, headings, mixed content
  • Verify edit loop in basecamp todos create / basecamp messages create against a real
    Basecamp account — created todo with heading, bold/italic, list with continuation,
    multiline blockquote, and code fence; server HTML → markdown → HTML → markdown
    round-tripped stably with all structure preserved

Copilot AI review requested due to automatic review settings April 9, 2026 20:38
@github-actions github-actions bot added tests Tests (unit and e2e) deps enhancement New feature or request labels Apr 9, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the custom Markdown→HTML conversion pipeline in internal/richtext with a goldmark-based converter configured for Basecamp/Trix-compatible output, and updates HTML→Markdown handling to improve edit-loop round-trip fidelity (notably for multiline blockquotes, list continuations, and code fences).

Changes:

  • Swap the hand-rolled MarkdownToHTML parser for a goldmark converter with custom AST transformer/renderer and a \@ mention-suppression inline parser.
  • Improve HTMLToMarkdown normalization for blockquotes, lists, and code fences to preserve multiline structure across round-trips.
  • Add/adjust richtext tests and promote goldmark to a direct dependency in go.mod.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/richtext/richtext.go Replaces Markdown conversion with goldmark + custom Trix components; adjusts HTML→Markdown normalization helpers.
internal/richtext/richtext_test.go Updates expected HTML and adds targeted round-trip and edge-case tests for the new behavior.
go.mod Adds github.com/yuin/goldmark as a direct requirement (was previously indirect).

Tip

If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/richtext/richtext.go">

<violation number="1" location="internal/richtext/richtext.go:613">
P1: `reStripP` regex matches `<pre>`, `<progress>`, and any other tag starting with `p`, not just `<p>`. The pattern `</?p[^>]*>` needs a word boundary after `p` — e.g., require the next character to be `\s` or `>` so that `<pre>` is excluded.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b053614b71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6574c73bcf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI review requested due to automatic review settings April 9, 2026 21:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d73e8155c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d8fca04b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI review requested due to automatic review settings April 10, 2026 17:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 271de640f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Replace the ~350-line regex pipeline (convertInline, placeholder
extraction/restoration, line-by-line parser) with goldmark configured
for Trix editor compatibility via three custom components:

- trixTransformer: forces tight lists, converts soft breaks to hard
  inside list items and blockquotes, inserts <br> between
  blank-line-separated blocks for Trix paragraph spacing
- trixRenderer: compact blockquote output, HTML escaping for raw
  HTML blocks (vs <!-- raw HTML omitted -->), TrixBreak and
  EscapedAt node rendering
- escapedAtParser: intercepts \@ before goldmark's standard escape
  handling to preserve mention-suppression syntax

HTMLToMarkdown gains <br>-aware round-trip support: blockquote and
list handlers now normalize <br> tags before splitting, so multiline
blockquotes and list continuations survive the edit loop
(MarkdownToHTML → HTMLToMarkdown) faithfully. Also fixes multiline
blockquote/paragraph regex matching ((?i) → (?is)) and trims
trailing newlines from code fence content.

Promotes goldmark from indirect to direct dependency (v1.7.13, already
present via glamour).
Copilot AI review requested due to automatic review settings April 10, 2026 19:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +659 to +686
for j < len(html) && depth > 0 {
// Decrement for any list close tag (handles mixed <ul>/<ol> nesting)
if hasPrefixFold(html[j:], "</ul>") || hasPrefixFold(html[j:], "</ol>") {
closeLen := 5 // len("</ul>") == len("</ol>")
depth--
if depth == 0 {
inner := html[contentStart:j]
var md string
if tag == "ul" {
md = convertULInner(inner)
} else {
md = convertOLInner(inner)
}
s := result.String()
if len(s) > 0 && s[len(s)-1] != '\n' {
result.WriteByte('\n')
}
result.WriteString(md + "\n\n")
j += closeLen
break
}
j += closeLen
} else if loc := reListOpen.FindStringSubmatchIndex(html[j:]); loc != nil && loc[0] == 0 {
depth++
j += loc[1]
} else {
j++
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In replaceBalancedListBlocks(), the inner scanning loop increments j one byte at a time and calls reListOpen.FindStringSubmatchIndex(html[j:]) on most iterations. This makes the list conversion potentially O(n²) on larger HTML bodies (especially when there are few tags, so the regex runs for many bytes). Consider scanning by jumping to the next '<' (e.g., via strings.IndexByte) and only then checking for open/close tags, or doing a single pass tokenization to avoid repeated regex calls per byte.

Copilot uses AI. Check for mistakes.
Comment on lines +672 to +676
s := result.String()
if len(s) > 0 && s[len(s)-1] != '\n' {
result.WriteByte('\n')
}
result.WriteString(md + "\n\n")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaceBalancedListBlocks() calls result.String() inside the main conversion loop to check whether the last written byte is a newline. strings.Builder.String() materializes the full accumulated string, which is avoidable extra work and can become costly for large inputs. Consider tracking whether the builder currently ends with a newline via a small state variable (or by maintaining a separate lastByte as you write) instead of converting the whole builder to a string each time.

Copilot uses AI. Check for mistakes.
Comment on lines +614 to +630
for j < len(html) && depth > 0 {
if hasPrefixFold(html[j:], "</li>") {
depth--
if depth == 0 {
content := html[contentStart:j]
content = replaceBalancedListBlocks(content)
items = append(items, content)
j += 5
break
}
j += 5
} else if loc := reLIOpen.FindStringIndex(html[j:]); loc != nil && loc[0] == 0 {
depth++
j += loc[1]
} else {
j++
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractListItems() uses a byte-by-byte scan (j++) and runs reLIOpen.FindStringIndex(html[j:]) repeatedly to detect nested

  • tags. This can lead to quadratic behavior on large list items. Consider advancing j by searching for the next '<' and then checking for <li / </li> prefixes (case-insensitively), reserving the regex for the initial “find next
  • ” step only.

  • Copilot uses AI. Check for mistakes.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    deps enhancement New feature or request tests Tests (unit and e2e)

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants