Skip to content

Commit ee15771

Browse files
committed
Update README.md for Dynamic LoRA Routing & Control Vectors
1 parent 57bfdd8 commit ee15771

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed

README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -521,6 +521,68 @@ llm = Llama.from_pretrained(
521521
522522
---
523523
524+
## Dynamic LoRA Routing & Control Vectors (Multi-Tenant Serving)
525+
526+
Historically, `llama-cpp-python` only supported "static loading" where a LoRA was permanently baked into the context during initialization. Switching personas required reloading the entire model or duplicating it in VRAM.
527+
528+
`llama-cpp-python` now supports **Just-In-Time (JIT)** dynamic adapter routing. Instead of statically binding a single LoRA to a model during initialization (which locks the instance to a single task), you can now preload multiple adapters into VRAM and seamlessly apply them on-the-fly per request.
529+
530+
This architecture unlocks true **Multi-Tenant Serving**:
531+
* **Zero-Latency Switching:** Compute graph weights are atomically modified in C++ memory instantly before evaluation.
532+
* **VRAM Efficiency:** You only load the heavy base model once. Multiple LoRAs share the same base model memory.
533+
* **Thread-Safe & Contamination-Free:** Strict internal state debouncing ensures that weights are perfectly cleaned between requests, guaranteeing zero persona contamination.
534+
535+
### Dynamic LoRA Example
536+
537+
```python
538+
from llama_cpp import Llama
539+
540+
# 1. Load the pure base model once
541+
llm = Llama(model_path="path/to/llama-3-8b.gguf")
542+
543+
# 2. Preload multiple LoRAs into VRAM
544+
llm.load_lora("python_coder", "path/to/python-coder-lora.gguf")
545+
llm.load_lora("translator", "path/to/spanish-translator-lora.gguf")
546+
547+
# 3. User A: Coding Task (Instantly applies the coder LoRA)
548+
response_a = llm.create_chat_completion(
549+
messages=[{"role": "user", "content": "Write a fast inverse square root in C."}],
550+
active_loras=[{"name": "python_coder", "scale": 1.0}]
551+
)
552+
553+
# 4. User B: Translation Task (Zero-latency switch to the translator LoRA)
554+
response_b = llm.create_chat_completion(
555+
messages=[{"role": "user", "content": "Explain quantum physics in Spanish."}],
556+
active_loras=[{"name": "translator", "scale": 0.85}] # Apply at 85% strength
557+
)
558+
559+
# 5. User C: General Query (Automatically wipes graph weights for a clean base model state)
560+
response_c = llm.create_chat_completion(
561+
messages=[{"role": "user", "content": "What is the capital of France?"}]
562+
)
563+
564+
# 6. Cleanup (Optional: manually free VRAM for specific LoRAs)
565+
llm.unload_lora("python_coder")
566+
```
567+
568+
### Control Vector Injection (Representation Engineering)
569+
570+
In addition to LoRA, the API supports dynamic injection of **Control Vectors (CVec)**. This allows you to steer the model's behavior, emotion, or alignment by directly modifying the activation values at specific hidden layers, without needing `.gguf` weight files.
571+
572+
```python
573+
response = llm.create_chat_completion(
574+
messages=[{"role": "user", "content": "Tell me a story about a futuristic city."}],
575+
control_vector={
576+
"data": [...], # A flattened 1D list of floats representing the vector
577+
"layer_start": 15, # Apply starting from this layer (inclusive)
578+
"layer_end": 32 # Apply up to this layer (inclusive)
579+
}
580+
)
581+
```
582+
*Note(JamePeng): Ensure your `data` array length exactly matches `embedding_length * layer_end`. The C++ backend maps the buffer continuously starting from layer 1, so early skipped layers must be zero-padded in your array.*
583+
584+
---
585+
524586
## Sampling Configuration & Usage (LlamaSamplingParams)
525587
526588
The `Llama` class provides extensive control over the `llama.cpp` sampling chain during text generation. You can configure state-of-the-art sampling algorithms, dynamic temperature, and advanced repetition penalties directly via the `generate`, `create_completion`, or `__call__` methods.

0 commit comments

Comments
 (0)