You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+62Lines changed: 62 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -521,6 +521,68 @@ llm = Llama.from_pretrained(
521
521
522
522
---
523
523
524
+
## Dynamic LoRA Routing & Control Vectors (Multi-Tenant Serving)
525
+
526
+
Historically, `llama-cpp-python` only supported "static loading" where a LoRA was permanently baked into the context during initialization. Switching personas required reloading the entire model or duplicating it in VRAM.
527
+
528
+
`llama-cpp-python` now supports **Just-In-Time (JIT)** dynamic adapter routing. Instead of statically binding a single LoRA to a model during initialization (which locks the instance to a single task), you can now preload multiple adapters into VRAM and seamlessly apply them on-the-fly per request.
529
+
530
+
This architecture unlocks true **Multi-Tenant Serving**:
531
+
* **Zero-Latency Switching:** Compute graph weights are atomically modified in C++ memory instantly before evaluation.
532
+
* **VRAM Efficiency:** You only load the heavy base model once. Multiple LoRAs share the same base model memory.
533
+
* **Thread-Safe & Contamination-Free:** Strict internal state debouncing ensures that weights are perfectly cleaned between requests, guaranteeing zero persona contamination.
# 4. User B: Translation Task (Zero-latency switch to the translator LoRA)
554
+
response_b = llm.create_chat_completion(
555
+
messages=[{"role": "user", "content": "Explain quantum physics in Spanish."}],
556
+
active_loras=[{"name": "translator", "scale": 0.85}] # Apply at 85% strength
557
+
)
558
+
559
+
# 5. User C: General Query (Automatically wipes graph weights for a clean base model state)
560
+
response_c = llm.create_chat_completion(
561
+
messages=[{"role": "user", "content": "What is the capital of France?"}]
562
+
)
563
+
564
+
# 6. Cleanup (Optional: manually free VRAM for specific LoRAs)
565
+
llm.unload_lora("python_coder")
566
+
```
567
+
568
+
### Control Vector Injection (Representation Engineering)
569
+
570
+
In addition to LoRA, the API supports dynamic injection of **Control Vectors (CVec)**. This allows you to steer the model's behavior, emotion, or alignment by directly modifying the activation values at specific hidden layers, without needing `.gguf` weight files.
571
+
572
+
```python
573
+
response = llm.create_chat_completion(
574
+
messages=[{"role": "user", "content": "Tell me a story about a futuristic city."}],
575
+
control_vector={
576
+
"data": [...], # A flattened 1D list of floats representing the vector
577
+
"layer_start": 15, # Apply starting from this layer (inclusive)
578
+
"layer_end": 32 # Apply up to this layer (inclusive)
579
+
}
580
+
)
581
+
```
582
+
*Note(JamePeng): Ensure your `data` array length exactly matches `embedding_length * layer_end`. The C++ backend maps the buffer continuously starting from layer 1, so early skipped layers must be zero-padded in your array.*
The `Llama` class provides extensive control over the `llama.cpp` sampling chain during text generation. You can configure state-of-the-art sampling algorithms, dynamic temperature, and advanced repetition penalties directly via the `generate`, `create_completion`, or `__call__` methods.
0 commit comments