Textagent
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎changelogs/CHANGELOG-semantic-search.md‎
Lines changed: 52 additions & 0 deletions b/‎changelogs/CHANGELOG-semantic-search.md‎
Lines changed: 52 additions & 0 deletions
@@ -43,7 +43,7 @@
 | **🎮 Game Builder** | `{{@Game:}}` tag — AI-generated games (Canvas 2D / Three.js / P5.js) or instant pre-built games via `@prebuilt:` field (chess, snake, shooter, pong, breakout, maths quiz, hiragana, kana master); engine selector pills; per-card model picker; CDN URL normalizer for CSP compliance; auto model-ready check before generation; 📋 Import button for pasting/uploading external HTML game code with source viewer; 📥 Export as standalone HTML; ⛶ fullscreen; single-line field parsing; "Games for Kids" template with 8 playable games |
 | **🐧 Linux Terminal** | `{{Linux:}}` tag — two modes: (1) Terminal mode opens full Debian Linux ([WebVM](https://webvm.io)) in new window with `Packages:` field; (2) Compile & Run mode (`Language:` + `Script:`) compiles/executes 25+ languages (C++, Rust, Go, Java, Python, TypeScript, Kotlin, Scala…) via [Judge0 CE](https://ce.judge0.com) with inline output, execution time & memory stats |
 | **❓ Help Mode** | Interactive learning mode — click ❓ Help to highlight all buttons, click any button for description + keyboard shortcut + animated demo video; 50% screen demo panel with fullscreen expand; 16 dedicated demo videos mapped to every toolbar button |
-| **🧠 Context Memory** | `{{@Memory:}}` tag for workspace intelligence — SQLite FTS5 full-text search with heading-aware chunking (~1500 chars/chunk); three storage modes: browser-only (IndexedDB), disk workspace (`.textagent/memory.db`), external folders (IndexedDB); `@use: workspace, my-docs` in AI/Think/Agent tags for multi-source context retrieval; Memory Selector dropdown on AI/Think/Agent cards; amber-accented Memory card with Folder/Files/Rebuild buttons + stats; auto-discovery of workspace files; `Use: none` opt-out; reuses existing sql.js WASM (zero bundle increase) |
+| **🧠 Context Memory** | `{{@Memory:}}` tag for workspace intelligence — **hybrid search**: SQLite FTS5 keyword search (40%) + EmbeddingGemma 300M semantic cosine similarity (60%) with heading-aware chunking (~1500 chars/chunk); 💎 Semantic toggle auto-downloads embedding model (~150MB WASM) on first file attach; three storage modes: browser-only (IndexedDB), disk workspace (`.textagent/memory.db`), external folders (IndexedDB); **file conversion**: binary formats (DOCX, XLSX, XLS, Numbers, PDF) auto-converted to markdown via Mammoth.js/SheetJS/PDF.js before indexing; `@use: workspace, my-docs` in AI/Think/Agent tags for multi-source context retrieval; Memory Selector dropdown on AI/Think/Agent cards; amber-accented Memory card with Folder/Files/Rebuild buttons + stats; auto-discovery of workspace files; `Use: none` opt-out; reuses existing sql.js WASM (zero bundle increase) |
 | **✉️ Email to Self** | Send documents directly to your inbox from the share modal — email address input with `.md` file attached + share link; powered by Google Apps Script (free, 100 emails/day); Cloudflare Turnstile CAPTCHA verification; dual rate limiting (100/day global + 7/day per recipient); loading state + success/error feedback; email persisted in localStorage; zero third-party dependencies |
 | **💾 Disk Workspace** | Folder-backed storage via File System Access API — "Open Folder" in sidebar header; `.md` files read/written directly to disk; `.textagent/workspace.json` manifest; debounced autosave ("💾 Saved to disk" indicator); refresh from disk for external edits; disconnect to revert to localStorage; auto-reconnect on reload via IndexedDB handles; unified action modal for rename/duplicate/delete with confirmation; Chromium-only (hidden in unsupported browsers) |
 | **📈 Finance Dashboard** | Stock/crypto/index dashboard templates with live TradingView charts; dynamic grid via `data-var-prefix` (add/remove tickers in `@variables` table, grid auto-adjusts); configurable chart range (`1M`, `12M`, `36M`), interval (`D`, `W`, `M`), and EMA period (default 52); interactive 1M/1Y/3Y range + 52D/52W/52M EMA toggle buttons; `@variables` table persists after ⚡ Vars for re-editing; JS code block generates grid HTML from variables |
@@ -539,6 +539,7 @@ TextAgent has undergone significant evolution since its inception. What started
 | Date | Commits | Feature / Update |
 |------|---------|-----------------:|
 | **2026-03-23** | | 📷 **OCR Camera Capture** — new 📷 camera button on `{{@OCR:}}` cards for live camera capture; `getUserMedia` with rear-camera preference (`facingMode: 'environment'`); modal overlay with live video feed, 📸 Capture → preview → 🔄 Retake / ✅ Use Photo flow; captured images stored as JPEG (0.85 quality, max 1280px) in `blockUploads` map; native `<input capture="environment">` fallback for browsers without `getUserMedia`; amber-themed modal CSS with dark/light support; Escape/overlay-click/✕ dismissal |
+| **2026-03-23** | | 💎 **Hybrid Semantic Search & File Conversion** — Memory indexer now supports hybrid search: FTS5 keyword matching (40%) + EmbeddingGemma 300M cosine similarity (60%); new `public/embedding-worker.js` Web Worker with WASM backend (WebGPU shader workaround); `memory_embeddings` table stores 256-dim float arrays per chunk; 💎 Semantic toggle on Memory card auto-downloads model (~150MB) when files are attached; binary file conversion integrated — DOCX (Mammoth.js), XLSX/XLS/Numbers (SheetJS), PDF (PDF.js+OCR), CSV, HTML, JSON, XML auto-converted to markdown before chunking; `M.convertFileToMarkdown()` public API exposed in `file-converters.js`; `.numbers` Apple Numbers format added to import pipeline |
 | **2026-03-23** | | 🐛 **GLM-OCR Model Download Fix** — fixed GLM-OCR model failing to download; `glm_ocr` model type was not supported in Transformers.js `4.0.0-next.7`; upgraded to `4.0.0-next.8` which includes `glm_ocr` model class mapping; both `ai-worker-glm-ocr.js` and `public/ai-worker-glm-ocr.js` updated |
 | **2026-03-23** | | 🌐 **API Explorer Template** — comprehensive API Explorer template listing ALL 1400+ public APIs from [public-apis/public-apis](https://github.com/public-apis/public-apis) across 51 categories; each category includes working `{{API:}}` blocks for no-auth APIs (click-to-try GET requests) plus reference tables for auth-required APIs with Auth type, HTTPS, and CORS info; auto-generated from GitHub raw README via Node.js parser; new `api-explorer` template category with `bi-globe2` icon; template count 136→137+, categories 14→15 |
 | **2026-03-23** | | 🎨 **Read-Only UI Cleanup** — composer FAB and floating panel hidden in read-only mode (`body.editor-readonly`); agent panel and toggles hidden when both read-only AND header-hidden (`body.editor-readonly.header-hidden`); removed redundant Import button from header toolbar, mobile menu, and QAB (Upload/drag-and-drop dropzone covers same 8-format import functionality); updated Help Mode entry for Upload button |
 
@@ -0,0 +1,52 @@
+# Hybrid Semantic Search & File Conversion for Memory Indexer
+
+## Summary
+
+Adds hybrid semantic search to Context Memory (FTS5 keywords + EmbeddingGemma cosine similarity) and integrates the existing file-to-markdown conversion pipeline into the Memory indexer, enabling binary file formats (DOCX, XLSX, Numbers, PDF) to be indexed.
+
+---
+
+## 1. Hybrid Semantic Search
+**Files:** `js/context-memory.js`, `public/embedding-worker.js` (new), `js/ai-docgen.js`
+
+**What:**
+- EmbeddingGemma 300M ONNX model (`textagent/embeddinggemma-300m-ONNX`) for generating 256-dim text embeddings
+- New embedding Web Worker (`public/embedding-worker.js`) with WASM backend (forced; WebGPU has shader issues with SimplifiedLayerNormalization)
+- Hybrid search combines FTS5 keyword search (40%) with semantic cosine similarity (60%)
+- `memory_embeddings` table stores per-chunk embeddings as JSON float arrays
+- 💎 Semantic toggle on Memory card: auto-downloads embedding model (~150MB) when files are attached, serves as status indicator and manual re-embed trigger
+- `enableSemanticSearch()`, `disableSemanticSearch()`, `getEmbeddingStatus()`, `reembedSource()` APIs
+
+**Impact:** Queries like "how does authentication work?" now find relevant chunks even if the exact keyword "authentication" isn't present, by understanding semantic meaning.
+
+## 2. File Conversion for Memory Indexer
+**Files:** `js/file-converters.js`, `js/context-memory.js`
+
+**What:**
+- Exposed `M.convertFileToMarkdown(file)` public API in `file-converters.js` — reuses existing converters (Mammoth.js, SheetJS, PDF.js, Turndown.js, native parsers)
+- Added `.numbers` (Apple Numbers) to the supported extensions map → uses SheetJS XLSX converter
+- Updated `processDir()` (folder attach) and `attachFiles()` (file picker) in `context-memory.js` to auto-detect binary formats and convert before indexing
+- Added `BINARY_EXTS` list: `docx`, `xlsx`, `xls`, `numbers`, `pdf`
+- Extended `TEXT_EXTS` list with: `ts`, `tsx`, `jsx`, `log`
+- Falls back to raw `file.text()` if conversion returns null
+
+**Impact:** Users can now attach folders containing DOCX, XLSX, Numbers, PDF files and they'll be properly converted to markdown before being chunked and indexed.
+
+## 3. Test Updates
+**Files:** `tests/feature/context-memory.spec.js`
+
+**What:**
+- Updated `modelSize` assertion from `'23MB'` to `'~150MB'` to match EmbeddingGemma model
+- Added new test cases for semantic search functionality
+
+---
+
+## Files Changed
+
+| File | Change |
+|------|--------|
+| `js/context-memory.js` | Hybrid search, embedding worker management, file conversion integration |
+| `js/file-converters.js` | `M.convertFileToMarkdown()` public API, `.numbers` extension |
+| `js/ai-docgen.js` | 💎 Semantic toggle button on Memory card, auto-embed on attach |
+| `public/embedding-worker.js` | **NEW** — Web Worker for EmbeddingGemma ONNX inference (WASM) |
+| `tests/feature/context-memory.spec.js` | Updated model size assertion, semantic search tests |