Skip to content

Commit 4180969

Browse files
committed
feat: hybrid semantic search & file conversion for Memory indexer
- EmbeddingGemma 300M ONNX model for 256-dim text embeddings (WASM backend) - Hybrid search: FTS5 keyword (40%) + semantic cosine similarity (60%) - 💎 Semantic toggle on Memory card auto-downloads model on file attach - File conversion: DOCX/XLSX/XLS/Numbers/PDF auto-converted before indexing - M.convertFileToMarkdown() public API in file-converters.js - .numbers (Apple Numbers) support via SheetJS - 📷 Camera capture button on OCR cards
1 parent a3902c3 commit 4180969

6 files changed

Lines changed: 663 additions & 43 deletions

File tree

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
| **🎮 Game Builder** | `{{@Game:}}` tag — AI-generated games (Canvas 2D / Three.js / P5.js) or instant pre-built games via `@prebuilt:` field (chess, snake, shooter, pong, breakout, maths quiz, hiragana, kana master); engine selector pills; per-card model picker; CDN URL normalizer for CSP compliance; auto model-ready check before generation; 📋 Import button for pasting/uploading external HTML game code with source viewer; 📥 Export as standalone HTML; ⛶ fullscreen; single-line field parsing; "Games for Kids" template with 8 playable games |
4444
| **🐧 Linux Terminal** | `{{Linux:}}` tag — two modes: (1) Terminal mode opens full Debian Linux ([WebVM](https://webvm.io)) in new window with `Packages:` field; (2) Compile & Run mode (`Language:` + `Script:`) compiles/executes 25+ languages (C++, Rust, Go, Java, Python, TypeScript, Kotlin, Scala…) via [Judge0 CE](https://ce.judge0.com) with inline output, execution time & memory stats |
4545
| **❓ Help Mode** | Interactive learning mode — click ❓ Help to highlight all buttons, click any button for description + keyboard shortcut + animated demo video; 50% screen demo panel with fullscreen expand; 16 dedicated demo videos mapped to every toolbar button |
46-
| **🧠 Context Memory** | `{{@Memory:}}` tag for workspace intelligence — SQLite FTS5 full-text search with heading-aware chunking (~1500 chars/chunk); three storage modes: browser-only (IndexedDB), disk workspace (`.textagent/memory.db`), external folders (IndexedDB); `@use: workspace, my-docs` in AI/Think/Agent tags for multi-source context retrieval; Memory Selector dropdown on AI/Think/Agent cards; amber-accented Memory card with Folder/Files/Rebuild buttons + stats; auto-discovery of workspace files; `Use: none` opt-out; reuses existing sql.js WASM (zero bundle increase) |
46+
| **🧠 Context Memory** | `{{@Memory:}}` tag for workspace intelligence — **hybrid search**: SQLite FTS5 keyword search (40%) + EmbeddingGemma 300M semantic cosine similarity (60%) with heading-aware chunking (~1500 chars/chunk); 💎 Semantic toggle auto-downloads embedding model (~150MB WASM) on first file attach; three storage modes: browser-only (IndexedDB), disk workspace (`.textagent/memory.db`), external folders (IndexedDB); **file conversion**: binary formats (DOCX, XLSX, XLS, Numbers, PDF) auto-converted to markdown via Mammoth.js/SheetJS/PDF.js before indexing; `@use: workspace, my-docs` in AI/Think/Agent tags for multi-source context retrieval; Memory Selector dropdown on AI/Think/Agent cards; amber-accented Memory card with Folder/Files/Rebuild buttons + stats; auto-discovery of workspace files; `Use: none` opt-out; reuses existing sql.js WASM (zero bundle increase) |
4747
| **✉️ Email to Self** | Send documents directly to your inbox from the share modal — email address input with `.md` file attached + share link; powered by Google Apps Script (free, 100 emails/day); Cloudflare Turnstile CAPTCHA verification; dual rate limiting (100/day global + 7/day per recipient); loading state + success/error feedback; email persisted in localStorage; zero third-party dependencies |
4848
| **💾 Disk Workspace** | Folder-backed storage via File System Access API — "Open Folder" in sidebar header; `.md` files read/written directly to disk; `.textagent/workspace.json` manifest; debounced autosave ("💾 Saved to disk" indicator); refresh from disk for external edits; disconnect to revert to localStorage; auto-reconnect on reload via IndexedDB handles; unified action modal for rename/duplicate/delete with confirmation; Chromium-only (hidden in unsupported browsers) |
4949
| **📈 Finance Dashboard** | Stock/crypto/index dashboard templates with live TradingView charts; dynamic grid via `data-var-prefix` (add/remove tickers in `@variables` table, grid auto-adjusts); configurable chart range (`1M`, `12M`, `36M`), interval (`D`, `W`, `M`), and EMA period (default 52); interactive 1M/1Y/3Y range + 52D/52W/52M EMA toggle buttons; `@variables` table persists after ⚡ Vars for re-editing; JS code block generates grid HTML from variables |
@@ -539,6 +539,7 @@ TextAgent has undergone significant evolution since its inception. What started
539539
| Date | Commits | Feature / Update |
540540
|------|---------|-----------------:|
541541
| **2026-03-23** | | 📷 **OCR Camera Capture** — new 📷 camera button on `{{@OCR:}}` cards for live camera capture; `getUserMedia` with rear-camera preference (`facingMode: 'environment'`); modal overlay with live video feed, 📸 Capture → preview → 🔄 Retake / ✅ Use Photo flow; captured images stored as JPEG (0.85 quality, max 1280px) in `blockUploads` map; native `<input capture="environment">` fallback for browsers without `getUserMedia`; amber-themed modal CSS with dark/light support; Escape/overlay-click/✕ dismissal |
542+
| **2026-03-23** | | 💎 **Hybrid Semantic Search & File Conversion** — Memory indexer now supports hybrid search: FTS5 keyword matching (40%) + EmbeddingGemma 300M cosine similarity (60%); new `public/embedding-worker.js` Web Worker with WASM backend (WebGPU shader workaround); `memory_embeddings` table stores 256-dim float arrays per chunk; 💎 Semantic toggle on Memory card auto-downloads model (~150MB) when files are attached; binary file conversion integrated — DOCX (Mammoth.js), XLSX/XLS/Numbers (SheetJS), PDF (PDF.js+OCR), CSV, HTML, JSON, XML auto-converted to markdown before chunking; `M.convertFileToMarkdown()` public API exposed in `file-converters.js`; `.numbers` Apple Numbers format added to import pipeline |
542543
| **2026-03-23** | | 🐛 **GLM-OCR Model Download Fix** — fixed GLM-OCR model failing to download; `glm_ocr` model type was not supported in Transformers.js `4.0.0-next.7`; upgraded to `4.0.0-next.8` which includes `glm_ocr` model class mapping; both `ai-worker-glm-ocr.js` and `public/ai-worker-glm-ocr.js` updated |
543544
| **2026-03-23** | | 🌐 **API Explorer Template** — comprehensive API Explorer template listing ALL 1400+ public APIs from [public-apis/public-apis](https://github.com/public-apis/public-apis) across 51 categories; each category includes working `{{API:}}` blocks for no-auth APIs (click-to-try GET requests) plus reference tables for auth-required APIs with Auth type, HTTPS, and CORS info; auto-generated from GitHub raw README via Node.js parser; new `api-explorer` template category with `bi-globe2` icon; template count 136→137+, categories 14→15 |
544545
| **2026-03-23** | | 🎨 **Read-Only UI Cleanup** — composer FAB and floating panel hidden in read-only mode (`body.editor-readonly`); agent panel and toggles hidden when both read-only AND header-hidden (`body.editor-readonly.header-hidden`); removed redundant Import button from header toolbar, mobile menu, and QAB (Upload/drag-and-drop dropzone covers same 8-format import functionality); updated Help Mode entry for Upload button |
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Hybrid Semantic Search & File Conversion for Memory Indexer
2+
3+
## Summary
4+
5+
Adds hybrid semantic search to Context Memory (FTS5 keywords + EmbeddingGemma cosine similarity) and integrates the existing file-to-markdown conversion pipeline into the Memory indexer, enabling binary file formats (DOCX, XLSX, Numbers, PDF) to be indexed.
6+
7+
---
8+
9+
## 1. Hybrid Semantic Search
10+
**Files:** `js/context-memory.js`, `public/embedding-worker.js` (new), `js/ai-docgen.js`
11+
12+
**What:**
13+
- EmbeddingGemma 300M ONNX model (`textagent/embeddinggemma-300m-ONNX`) for generating 256-dim text embeddings
14+
- New embedding Web Worker (`public/embedding-worker.js`) with WASM backend (forced; WebGPU has shader issues with SimplifiedLayerNormalization)
15+
- Hybrid search combines FTS5 keyword search (40%) with semantic cosine similarity (60%)
16+
- `memory_embeddings` table stores per-chunk embeddings as JSON float arrays
17+
- 💎 Semantic toggle on Memory card: auto-downloads embedding model (~150MB) when files are attached, serves as status indicator and manual re-embed trigger
18+
- `enableSemanticSearch()`, `disableSemanticSearch()`, `getEmbeddingStatus()`, `reembedSource()` APIs
19+
20+
**Impact:** Queries like "how does authentication work?" now find relevant chunks even if the exact keyword "authentication" isn't present, by understanding semantic meaning.
21+
22+
## 2. File Conversion for Memory Indexer
23+
**Files:** `js/file-converters.js`, `js/context-memory.js`
24+
25+
**What:**
26+
- Exposed `M.convertFileToMarkdown(file)` public API in `file-converters.js` — reuses existing converters (Mammoth.js, SheetJS, PDF.js, Turndown.js, native parsers)
27+
- Added `.numbers` (Apple Numbers) to the supported extensions map → uses SheetJS XLSX converter
28+
- Updated `processDir()` (folder attach) and `attachFiles()` (file picker) in `context-memory.js` to auto-detect binary formats and convert before indexing
29+
- Added `BINARY_EXTS` list: `docx`, `xlsx`, `xls`, `numbers`, `pdf`
30+
- Extended `TEXT_EXTS` list with: `ts`, `tsx`, `jsx`, `log`
31+
- Falls back to raw `file.text()` if conversion returns null
32+
33+
**Impact:** Users can now attach folders containing DOCX, XLSX, Numbers, PDF files and they'll be properly converted to markdown before being chunked and indexed.
34+
35+
## 3. Test Updates
36+
**Files:** `tests/feature/context-memory.spec.js`
37+
38+
**What:**
39+
- Updated `modelSize` assertion from `'23MB'` to `'~150MB'` to match EmbeddingGemma model
40+
- Added new test cases for semantic search functionality
41+
42+
---
43+
44+
## Files Changed
45+
46+
| File | Change |
47+
|------|--------|
48+
| `js/context-memory.js` | Hybrid search, embedding worker management, file conversion integration |
49+
| `js/file-converters.js` | `M.convertFileToMarkdown()` public API, `.numbers` extension |
50+
| `js/ai-docgen.js` | 💎 Semantic toggle button on Memory card, auto-embed on attach |
51+
| `public/embedding-worker.js` | **NEW** — Web Worker for EmbeddingGemma ONNX inference (WASM) |
52+
| `tests/feature/context-memory.spec.js` | Updated model size assertion, semantic search tests |

0 commit comments

Comments
 (0)