feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616
feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616krishjp wants to merge 11 commits intoPrunaAI:mainfrom
Conversation
…device checks for llama-cpp models due to a lack of model.parameters() support
…on 3.13 - addressed functools.partial object compatability with py 3.13 - integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS - updated the LlamaCpp algorithm implementation to utilize the standardized naming convention. - cleaned up redundant commented-out logic in the save_pruna_model function. Verified through restoration of LlamaCpp integration tests and diagnostic scripts confirming Enum member registration.
…form usage - standardized LlamaCpp implementation and naming conventions within the engine - implemented cache directory cleanup to prevent shutdown errors on Windows - added a save() alias to the base model wrapper for improved API consistency - updated project configuration with Llama.cpp and dependency group - benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization
- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 49 |
| Duplication | 0 |
TIP This summary will be updated as you push new changes. Give us feedback
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584. |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 93fad34. Configure here.
llcnt
left a comment
There was a problem hiding this comment.
Thank you for the improved version of the PR!
We are definitely very close to the final step:)
| processor_required: bool = False | ||
| dataset_required: bool = False | ||
| runs_on: list[str] = ["cpu", "cuda", "mps"] | ||
| compatible_before: list[str] = [] |
There was a problem hiding this comment.
I think that reduce_noe algo is compatible before !
| "n_gpu_layers", | ||
| sequence=[0, 1, 4, 8, 16, 32, 999], | ||
| default_value=0, | ||
| meta={"desc": "Number of layers to offload to GPU. Use 999 for all layers."}, |
There was a problem hiding this comment.
Why using '999' here and not '-1' as in llamacpp ? I guess you can define the Int to accept such negative value, no ?
| def _load_quantized_model(self, llama_cpp: Any, quant_gguf_path: Path, smash_config: Any, temp_dir: Path) -> Any: | ||
| pruna_logger.info(f"Loading quantized model from {quant_gguf_path}") | ||
| n_gpu_layers = smash_config["n_gpu_layers"] | ||
| if n_gpu_layers == 999: |
| """Set the model to evaluation mode.""" | ||
| set_to_eval(self.model) | ||
|
|
||
| def save(self, model_path: str) -> None: |
There was a problem hiding this comment.
why do we need this alias ?
| raise FileNotFoundError(f"GGUF file not found at {model_path}") | ||
|
|
||
| model = llama_cpp.Llama(model_path=str(model_path), **filter_load_kwargs(llama_cpp.Llama.__init__, kwargs)) | ||
| model.model_path = str(model_path) |
There was a problem hiding this comment.
same question as in llama_cpp.py file :)
| n_gpu_layers=n_gpu_layers, | ||
| main_gpu=smash_config["main_gpu"], | ||
| ) | ||
| quantized_model.model_path = str(quant_gguf_path) |
|
|
||
| # if save-before-move was the last operation, we simply move the already saved files, we have delt with them before | ||
| elif smash_config.save_fns[-1] == SAVE_FUNCTIONS.save_before_apply.name: | ||
| elif len(smash_config.save_fns) > 0 and smash_config.save_fns[-1] == get_fn_name(SAVE_FUNCTIONS.save_before_apply): |
There was a problem hiding this comment.
could we keep the comment just above for reference ?

Description
This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.
Key Changes:
llama-cpp-pythonas a new quantizer backend, supporting various GGUF quantization methods (e.g.,q4_k_m).KeyErrorin [SAVE_FUNCTIONS] andLOAD_FUNCTIONSby explicitly usingenum.member()for callable members (with a backward-compatible fallback for older Python versions).AttributeErrorduring interpreter shutdown on Windows.llamacppoptional dependency group and updated thefullextra in [pyproject.toml].Related Issue
Fixes #377
Related PRs
#583 - takes a more general look at the enum modification
Type of Change
How Has This Been Tested?
Enummember registration for engine save/load functions.SmolLM2-135M-Instructusing llama.cppq4_k_mquantization.Checklist
Additional Notes
The
TypeErroroccasionally observed duringllama-cpp-pythonshutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.