-
Notifications
You must be signed in to change notification settings - Fork 638
Refactor turbomind engine #4223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR performs a major refactoring of the turbomind engine architecture with the following key changes:
- Replaces
LlamaTritonModelwith a newTurboMindclass providing a cleaner API - Removes the old batch processing implementation (
LlamaBatch,LlamaV2) - Introduces new model abstractions:
LanguageModel,InputProcessor, andOutputProcessorto better separate concerns - Updates
RequestMetricsfields to use atomic operations for thread-safe access - Consolidates model-related code into a unified
modelsCMake target
Reviewed changes
Copilot reviewed 102 out of 102 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/utils/metrics.h | Changed metric fields to atomic types and fixed typo in field name |
| src/turbomind/turbomind.h/cc | New TurboMind class interface replacing LlamaTritonModel |
| src/turbomind/triton_backend/llama/* | Removed old triton backend files |
| src/turbomind/python/bind.cpp | Updated Python bindings to use new TurboMind class |
| src/turbomind/models/language_model.* | New LanguageModel abstraction for inference |
| src/turbomind/models/input_processor.* | New component for handling input processing |
| src/turbomind/models/output_processor.* | New component for handling output processing |
| src/turbomind/models/llama/unified_decoder.* | Updated to work with new architecture |
| src/turbomind/models/llama/unified_attention_layer.* | Refactored attention layer implementation |
| src/turbomind/models/llama/llama_utils.cu | Changed isTuning() from thread_local to static |
| src/turbomind/layers/sampling_layers/* | Removed old sampling layer files |
| src/turbomind/kernels/sampling_kernels.h | Changed sampled_indexes/nums types from uint32_t to int |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| struct RequestMetrics { | ||
| int64_t enque_time{}; // when a request is enqued | ||
| int64_t scheduled_time{}; // when a request is scheduled for inference | ||
| std::atomic<int64_t> enqueue_time{}; // when a request is enqued |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error in comment: "enqued" should be "enqueued".
| os << "RequestMetrics { "; | ||
| os << "enque_time=" << m.enque_time; | ||
| os << ", scheduled_time=" << m.scheduled_time; | ||
| os << "enque_time=" << m.enqueue_time.load(std::memory_order_relaxed); |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent naming: the output stream operator still uses the old field name "enque_time" in the message, but it should be "enqueue_time" to match the renamed field.
| bool& isTuning() | ||
| { | ||
| thread_local bool value{}; | ||
| static bool value{}; |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isTuning() function changed from thread_local to static, which removes thread safety. If this function is accessed from multiple threads simultaneously in the new architecture, this could cause race conditions. Consider whether thread-local storage is still needed or if synchronization is required.
No description provided.