feat: integrate Llama.cpp and enhance engine stability for cross-platform usage by krishjp · Pull Request #616 · PrunaAI/pruna

krishjp · 2026-04-06T20:34:43Z

Description

This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.

Note: this PR is a rework of feat: integrate Llama.cpp and enhance engine stability for cross-platform usage #584

Key Changes:

Engine Support: Integrated llama-cpp-python as a new quantizer backend, supporting various GGUF quantization methods (e.g., q4_k_m).
Python 3.13 Compatibility: Fixed a KeyError in [SAVE_FUNCTIONS] and LOAD_FUNCTIONS by explicitly using enum.member() for callable members (with a backward-compatible fallback for older Python versions).
Stability: Implemented safer cache directory cleanup in [SmashConfig] to prevent AttributeError during interpreter shutdown on Windows.
Consistency: Added a [save()] alias to [PrunaModel] to match [save_pretrained()] and ensure consistent attribute delegation for non-torch backends.
Dependencies: Added the llamacpp optional dependency group and updated the full extra in [pyproject.toml].

Related Issue

Fixes #377

Related PRs

#583 - takes a more general look at the enum modification

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Integration Tests: Verified that [TestLlamaCpp] llama_cpp.py passes successfully on Windows using Python 3.12 and 3.13.
Diagnostic Scripts: Confirmed correct Enum member registration for engine save/load functions.
Local Benchmarking Script: Successfully smash'ed SmolLM2-135M-Instruct using llama.cpp q4_k_m quantization.
- Compression: 4.88x reduction in model size.
- Speedup: 4.14x faster inference results (tokens/sec) on CPU.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

The TypeError occasionally observed during llama-cpp-python shutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.

…device checks for llama-cpp models due to a lack of model.parameters() support

…on 3.13 - addressed functools.partial object compatability with py 3.13 - integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS - updated the LlamaCpp algorithm implementation to utilize the standardized naming convention. - cleaned up redundant commented-out logic in the save_pruna_model function. Verified through restoration of LlamaCpp integration tests and diagnostic scripts confirming Enum member registration.

…form usage - standardized LlamaCpp implementation and naming conventions within the engine - implemented cache directory cleanup to prevent shutdown errors on Windows - added a save() alias to the base model wrapper for improved API consistency - updated project configuration with Llama.cpp and dependency group - benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization

- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.

codacy-production · 2026-04-06T20:35:48Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 49 complexity · 0 duplication

Metric Results

Complexity 49

Duplication 0

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

review-notebook-app · 2026-04-06T22:29:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

krishjp · 2026-04-06T22:41:43Z

Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584.
I'm looking at the last few codacy issues that were brought up but the main codebase changes should be in place. ruff check also found some other fixes from older commits, so they are included here as well.

krishjp · 2026-04-07T15:29:58Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 93fad34. Configure here.}

src/pruna/engine/save.py

llcnt

Thank you for the improved version of the PR!
We are definitely very close to the final step:)

llcnt · 2026-04-10T14:24:07Z

src/pruna/algorithms/llama_cpp.py

+    processor_required: bool = False
+    dataset_required: bool = False
+    runs_on: list[str] = ["cpu", "cuda", "mps"]
+    compatible_before: list[str] = []


I think that reduce_noe algo is compatible before !

llcnt · 2026-04-10T14:25:45Z

src/pruna/algorithms/llama_cpp.py

+                "n_gpu_layers",
+                sequence=[0, 1, 4, 8, 16, 32, 999],
+                default_value=0,
+                meta={"desc": "Number of layers to offload to GPU. Use 999 for all layers."},


Why using '999' here and not '-1' as in llamacpp ? I guess you can define the Int to accept such negative value, no ?

llcnt · 2026-04-10T14:27:08Z

src/pruna/algorithms/llama_cpp.py

+    def _load_quantized_model(self, llama_cpp: Any, quant_gguf_path: Path, smash_config: Any, temp_dir: Path) -> Any:
+        pruna_logger.info(f"Loading quantized model from {quant_gguf_path}")
+        n_gpu_layers = smash_config["n_gpu_layers"]
+        if n_gpu_layers == 999:


same comment as above ;)

llcnt · 2026-04-10T14:29:50Z

src/pruna/engine/pruna_model.py

        """Set the model to evaluation mode."""
        set_to_eval(self.model)

+    def save(self, model_path: str) -> None:


why do we need this alias ?

llcnt · 2026-04-10T14:30:52Z

src/pruna/engine/load.py

+        raise FileNotFoundError(f"GGUF file not found at {model_path}")
+
+    model = llama_cpp.Llama(model_path=str(model_path), **filter_load_kwargs(llama_cpp.Llama.__init__, kwargs))
+    model.model_path = str(model_path)


same question as in llama_cpp.py file :)

llcnt · 2026-04-10T14:32:55Z

src/pruna/algorithms/llama_cpp.py

+            n_gpu_layers=n_gpu_layers,
+            main_gpu=smash_config["main_gpu"],
+        )
+        quantized_model.model_path = str(quant_gguf_path)


why do we need this ?

llcnt · 2026-04-10T14:34:25Z

src/pruna/engine/save.py


-    # if save-before-move was the last operation, we simply move the already saved files, we have delt with them before
-    elif smash_config.save_fns[-1] == SAVE_FUNCTIONS.save_before_apply.name:
+    elif len(smash_config.save_fns) > 0 and smash_config.save_fns[-1] == get_fn_name(SAVE_FUNCTIONS.save_before_apply):


could we keep the comment just above for reference ?

krishjp and others added 7 commits April 6, 2026 10:34

feat: implement llama.cpp algorithm

55f5dc2

feat: llama.cpp conversion by forcing f16 for tiny models and bypass …

40ee2b2

…device checks for llama-cpp models due to a lack of model.parameters() support

fix: integrity verification of remote scripts

d3935f7

fix: ruff typechecking and shutil.move on GGUF file handling

beb6e70

feat: updated llama support with rebased head branch commits

069a3b7

- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.

krishjp changed the title ~~Feat/llama cpp~~ feat: integrate Llama.cpp and enhance engine stability for cross-platform usage Apr 6, 2026

cursor bot reviewed Apr 7, 2026

View reviewed changes

src/pruna/engine/save.py Outdated Show resolved Hide resolved

krishjp marked this pull request as ready for review April 7, 2026 15:48

krishjp force-pushed the feat/llama-cpp branch from 989137b to 134cf0f Compare April 7, 2026 15:57

krishjp added 4 commits April 7, 2026 09:01

fix: ruff check fixes and llama_cpp updates

ff4405e

refactor: llama_cpp code length update and extra comments for visibility

764de81

refactor: code complexity

c438321

refactor: removed dead code from save_model_llama_cpp in save.py

09789d0

krishjp force-pushed the feat/llama-cpp branch from 134cf0f to 09789d0 Compare April 7, 2026 16:01

llcnt requested changes Apr 10, 2026

View reviewed changes

Conversation

krishjp commented Apr 6, 2026

Description

Related Issue

Related PRs

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

codacy-production bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

review-notebook-app bot commented Apr 6, 2026

Uh oh!

krishjp commented Apr 6, 2026

Uh oh!

krishjp commented Apr 7, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production bot commented Apr 6, 2026 •

edited

Loading