Feature/model download by mikedoise · Pull Request #142 · mattt/AnyLanguageModel

mikedoise · 2026-02-27T19:47:39Z

Hello,

This PR is to add download management to AnyLanguageModel. THe goal is to allow the built in MLX-Swift download mechanics to work through AnyLanguageModel. This has been tested in my app. Please let me know if we need to make any changes, or if you have any questions.

Persist KV caches across respond()/streamResponse() calls within the same LanguageModelSession. On subsequent turns only the new tokens are prefilled instead of re-encoding the entire conversation history, dramatically reducing time to first token. - Add maxKVSize, kvBits, kvGroupSize to GenerationOptions - Add SessionCacheEntry store with NSMapTable weak keys - Implement incremental prefill in streamResponse() and respond() - Enhance prewarm() to prefill system prompt into KV cache Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add GPUMemoryConfiguration struct with .automatic (RAM-scaled) and .unconstrained presets for controlling Metal buffer pool limits - Add GPUMemoryManager singleton with reference-counted active/idle toggling — cache stays high during concurrent generations, drops to idle limit only when all sessions complete - Wrap respond(), streamResponse(), and prewarm() with markActive/markIdle - Call evict() on removeFromCache/removeAllFromCache to reclaim GPU buffers - Upgrade mlx-swift from 0.29.1 to 0.30.6 (fast SDPA, cache race fix, Memory API, wired memory, iPhone 16 Pro NAX fix) - Upgrade mlx-swift-lm from 2.29.3 to 2.30.6 (Gemma3n per-layer intermediate_size, model loading perf, chat rehydration, tool calling) - Migrate deprecated GPU.set(cacheLimit:)/GPU.clearCache() to Memory.* Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

Gemma 3's Jinja chat template has no tool role support, causing tool result messages to crash the template engine during chat history replay. This fixes the issue by folding tool outputs into the preceding assistant message instead of using a separate .tool() role. Changes: - Fold tool results into assistant messages with [Tool result]: prefix to maintain strict user/assistant alternation required by Gemma 3 - Add max tool iteration guard (5) to prevent infinite tool-call loops - Fix convertToSendableJSONValue to return NSNull() instead of JSONValue.null so Jinja's Value(any:) can handle it - Check Bool before NSNumber to prevent booleans becoming integers - Record assistant text before tool calls in transcript for accurate chat replay and KV cache consistency - Move final text accumulation to after tool loop exit so only the final response is returned Fixes mattt#112 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ool-aware prewarm - Add prefillTokenHash to SessionCacheEntry to detect stale cache from replaced conversations (not just token count) - Extract resolveCache() helper to deduplicate cache hit/miss logic between respond() and streamResponse() - GPUMemoryManager.configure() now uses first-write-wins to prevent multiple MLXLanguageModel instances from silently overwriting config - prewarm() accepts tools via protocol and session automatically forwards registered tools so prefill tokenization matches respond() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Detect when the MLX tool loop generates the same tool call signature as the previous iteration and break early instead of retrying - Clear sessionKVCache in removeAllFromCache() so memory warning handlers actually free GPU memory from cached KV states Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…esults

…e tool call detection # Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

Adds snapshotTranscript() and replaceTranscript() to LanguageModelSession, enabling consumers to swap a session's context without creating a second session. This is critical for MLX where each session allocates a KV cache in app memory — swapping transcripts keeps peak memory at one cache. - LanguageModel protocol: add invalidateCache(for:) with default no-op - LanguageModelSession: add snapshotTranscript() and replaceTranscript() with atomic check-and-mutate to prevent TOCTOU races - MLXLanguageModel: override invalidateCache to evict KV cache entry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Expose proactive download management for MLX models: - DownloadProgress struct, ModelDownloadState enum, DownloadableLanguageModel protocol - MLXModelDownloadManager (@observable) with disk-state scanning, AsyncStream<DownloadProgress> downloads, and delete support - MLXLanguageModel conforms to DownloadableLanguageModel - Wire progress reporting into loadContext() so lazy downloads also update state - Re-export Hub types for consumer access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mikedoise and others added 10 commits February 23, 2026 09:02

Merge remote-tracking branch 'upstream/main' into feature/mlx-kv-cache

b7d8436

# Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

Merge fix/mlx-jinja-tool-calls: Jinja template alternation for tool r…

5697b38

…esults

Merge feature/mlx-kv-cache: KV cache reuse, PR review fixes, duplicat…

b6d26c0

…e tool call detection # Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/model download#142

Feature/model download#142
mikedoise wants to merge 10 commits intomattt:mainfrom
Techopolis:feature/model-download

mikedoise commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikedoise commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant