Skip to content

UPSTREAM PR #1239: LoRA: improve LoCon support with other naming conventions#39

Open
loci-dev wants to merge 1 commit intomasterfrom
upstream-PR1239-branch_stduhpf-locon
Open

UPSTREAM PR #1239: LoRA: improve LoCon support with other naming conventions#39
loci-dev wants to merge 1 commit intomasterfrom
upstream-PR1239-branch_stduhpf-locon

Conversation

@loci-dev
Copy link

Mirrored from leejet/stable-diffusion.cpp#1239

Lora "mid" weights for convolution layers were being ignored, wich can cause crashes (stable-diffusion.cpp\lora.hpp:498: GGML_ASSERT(ggml_nelements(diff) == ggml_nelements(model_tensor)) failed) when loading some LoRA models.

This should fix it in a lot of, if not all, cases.

Example model that fails before this change: https://civitai.green/models/918898/lokrconcept-ahetobleh-for-illustrious-based-models

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 30, 2026 16:46 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Jan 30, 2026

Overview

Analysis of 47,944 functions across two binaries reveals minimal performance impact from a single commit improving LoRA/LoCon naming convention support. Modified functions: 85 (0.18%), new: 53, removed: 32, unchanged: 47,774 (99.64%).

Power Consumption:

  • build.bin.sd-server: 502,849.71 nJ → 502,726.15 nJ (-0.025%)
  • build.bin.sd-cli: 469,847.60 nJ → 469,987.58 nJ (+0.03%)

Energy efficiency remains essentially unchanged, confirming negligible impact on production workloads.

Function Analysis

Most Significant Regressions:

  • std::vector<float>::begin() (sd-server): Response time +217% (+181ns: 83.36ns → 264.17ns), throughput time +289% (+181ns: 62.49ns → 243.30ns). Standard library function with no source changes; regression likely from compiler optimization differences or loss of inlining.

  • ggml_barrier (sd-cli): Response time +75% (+153ns: 203.13ns → 355.95ns), throughput time +82% (+153ns: 187.16ns → 339.98ns). Thread synchronization primitive in GGML CPU backend (external submodule). Most concerning regression as barriers are on critical path for parallel execution; source code not accessible.

  • std::_Hashtable::end() (sd-cli, cache): Response time +137% (+162ns: 118.61ns → 280.70ns), throughput time +195% (+162ns: 83.27ns → 245.36ns). Used in CacheDitConditionState for diffusion model caching; no source changes in cache_dit.hpp.

  • std::swap (sd-server, httplib): Response time +76% (+76ns: 99.96ns → 176.17ns), throughput time +104% (+76ns: 73.16ns → 149.37ns). Function pointer swap for HTTP callbacks; unrelated to LoRA changes.

Most Significant Improvements:

  • std::_Hashtable::end() (sd-server, sampler): Response time -58% (-162ns: 279.47ns → 117.39ns), throughput time -66% (-162ns: 245.36ns → 83.27ns). Used for sampler method lookups; compiler optimization improvement.

  • std::_Rb_tree::_M_insert_unique (sd-cli): Response time -5% (-91ns: 1762.53ns → 1671.91ns), throughput time -46% (-91ns: 196.87ns → 106.25ns). Red-black tree insertion showing substantial self-time optimization.

  • std::unordered_map::operator[] (sd-cli): Response time -1% (-63ns: 5564.19ns → 5500.92ns), throughput time -33% (-63ns: 194.13ns → 130.71ns). Used extensively in lora.hpp for tensor lookups; improvement aligns with LoRA naming convention changes, suggesting better compiler optimizations.

Source Code Context:

The single commit modified only lora.hpp, adding fallback logic for LoKr weight naming conventions in preprocess_lora_tensors(). This preprocessing logic executes during model initialization, not inference. Most performance variations occur in standard library functions with no source changes, indicating compiler/toolchain differences rather than code-driven regressions.

Other analyzed functions (validation, logging, constructors, string conversion) showed minor changes with negligible cumulative impact.

Additional Findings

Critical Path Assessment: No changes detected in inference hot paths (tensor operations, attention mechanisms, GPU kernels). Core ML workloads remain unaffected.

Primary Concern: The ggml_barrier regression (+82% throughput) warrants monitoring in multi-threaded CPU inference workloads, as synchronization overhead could accumulate across parallel operations. However, source code is in external GGML submodule and not attributable to stable-diffusion.cpp changes.

LoRA Impact: The naming convention improvements successfully enhance model compatibility. The 33% throughput improvement in std::unordered_map::operator[] suggests the code changes enabled better compiler optimizations for tensor name lookups during model loading.

Overall Assessment: The target version achieves functional objectives (enhanced LoRA compatibility) without measurable inference performance degradation. Performance variations are predominantly compiler-driven, with no systemic regressions in production-critical operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the master branch 3 times, most recently from 0219cb4 to 17a1e1e Compare February 1, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants