Skip to content

fix: preserve mermaid diagram text from SVGs during scraping (#1043)#1845

Open
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-mermaid-svg-text-1043
Open

fix: preserve mermaid diagram text from SVGs during scraping (#1043)#1845
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-mermaid-svg-text-1043

Conversation

@hafezparast
Copy link

Summary

  • Fixes [Bug]: Missing Mermaid Flowcharts #1043
  • Mermaid diagrams rendered as SVGs were completely stripped during HTML cleaning, losing all diagram text content (node labels, edge labels, class names, etc.)
  • Now detects SVGs with id="mermaid-*", extracts text from .nodeLabel, .label span, .edgeLabel span elements, and replaces the SVG with a fenced language-mermaid code block

Changes

  • crawl4ai/content_scraping_strategy.py: Added mermaid SVG detection and text extraction in _scrap() before tag cleanup

How it works

  1. Finds <svg> elements with id starting with mermaid-
  2. Reads aria-roledescription for diagram type (flowchart, class, sequence, etc.)
  3. Extracts text from node/edge label elements
  4. Replaces SVG with <pre><code class="language-mermaid"> containing diagram type and labels
  5. Non-mermaid SVGs are unaffected

Test plan

  • New test suite: tests/test_issue_1043_mermaid_svg.py (17 tests)
  • Regression suite: 304/305 passing (1 pre-existing HuggingFace failure)
  • Tests cover: flowchart, class diagram, sequence diagram types
  • Tests cover: edge cases (empty SVG, no aria, malformed, duplicates, multiple diagrams)

Generated with Claude Code

…de#1043)

Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ntohidi
Copy link
Collaborator

ntohidi commented Mar 22, 2026

it can be merged. will merge soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants