UPSTREAM PR #1239: LoRA: improve LoCon support with other naming conventions by loci-dev · Pull Request #39 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-01-30T16:46:28Z

Mirrored from leejet/stable-diffusion.cpp#1239

Lora "mid" weights for convolution layers were being ignored, wich can cause crashes (stable-diffusion.cpp\lora.hpp:498: GGML_ASSERT(ggml_nelements(diff) == ggml_nelements(model_tensor)) failed) when loading some LoRA models.

This should fix it in a lot of, if not all, cases.

Example model that fails before this change: https://civitai.green/models/918898/lokrconcept-ahetobleh-for-illustrious-based-models

loci-review · 2026-01-30T17:48:58Z

Overview

Analysis of 47,944 functions across two binaries reveals minimal performance impact from a single commit improving LoRA/LoCon naming convention support. Modified functions: 85 (0.18%), new: 53, removed: 32, unchanged: 47,774 (99.64%).

Power Consumption:

build.bin.sd-server: 502,849.71 nJ → 502,726.15 nJ (-0.025%)
build.bin.sd-cli: 469,847.60 nJ → 469,987.58 nJ (+0.03%)

Energy efficiency remains essentially unchanged, confirming negligible impact on production workloads.

Function Analysis

Most Significant Regressions:

std::vector<float>::begin() (sd-server): Response time +217% (+181ns: 83.36ns → 264.17ns), throughput time +289% (+181ns: 62.49ns → 243.30ns). Standard library function with no source changes; regression likely from compiler optimization differences or loss of inlining.
ggml_barrier (sd-cli): Response time +75% (+153ns: 203.13ns → 355.95ns), throughput time +82% (+153ns: 187.16ns → 339.98ns). Thread synchronization primitive in GGML CPU backend (external submodule). Most concerning regression as barriers are on critical path for parallel execution; source code not accessible.
std::_Hashtable::end() (sd-cli, cache): Response time +137% (+162ns: 118.61ns → 280.70ns), throughput time +195% (+162ns: 83.27ns → 245.36ns). Used in CacheDitConditionState for diffusion model caching; no source changes in cache_dit.hpp.
std::swap (sd-server, httplib): Response time +76% (+76ns: 99.96ns → 176.17ns), throughput time +104% (+76ns: 73.16ns → 149.37ns). Function pointer swap for HTTP callbacks; unrelated to LoRA changes.

Most Significant Improvements:

std::_Hashtable::end() (sd-server, sampler): Response time -58% (-162ns: 279.47ns → 117.39ns), throughput time -66% (-162ns: 245.36ns → 83.27ns). Used for sampler method lookups; compiler optimization improvement.
std::_Rb_tree::_M_insert_unique (sd-cli): Response time -5% (-91ns: 1762.53ns → 1671.91ns), throughput time -46% (-91ns: 196.87ns → 106.25ns). Red-black tree insertion showing substantial self-time optimization.
std::unordered_map::operator[] (sd-cli): Response time -1% (-63ns: 5564.19ns → 5500.92ns), throughput time -33% (-63ns: 194.13ns → 130.71ns). Used extensively in lora.hpp for tensor lookups; improvement aligns with LoRA naming convention changes, suggesting better compiler optimizations.

Source Code Context:

The single commit modified only lora.hpp, adding fallback logic for LoKr weight naming conventions in preprocess_lora_tensors(). This preprocessing logic executes during model initialization, not inference. Most performance variations occur in standard library functions with no source changes, indicating compiler/toolchain differences rather than code-driven regressions.

Other analyzed functions (validation, logging, constructors, string conversion) showed minor changes with negligible cumulative impact.

Additional Findings

Critical Path Assessment: No changes detected in inference hot paths (tensor operations, attention mechanisms, GPU kernels). Core ML workloads remain unaffected.

Primary Concern: The ggml_barrier regression (+82% throughput) warrants monitoring in multi-threaded CPU inference workloads, as synchronization overhead could accumulate across parallel operations. However, source code is in external GGML submodule and not attributable to stable-diffusion.cpp changes.

LoRA Impact: The naming convention improvements successfully enhance model compatibility. The 33% throughput improvement in std::unordered_map::operator[] suggests the code changes enabled better compiler optimizations for tensor name lookups during model loading.

Overall Assessment: The target version achieves functional objectives (enhanced LoRA compatibility) without measurable inference performance degradation. Performance variations are predominantly compiler-driven, with no systemic regressions in production-critical operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

LoRA: improve LoCon support with other naming conventions

7b212e4

loci-dev temporarily deployed to stable-diffusion-cpp-prod January 30, 2026 16:46 — with GitHub Actions Inactive

loci-dev force-pushed the master branch 3 times, most recently from 0219cb4 to 17a1e1e Compare February 1, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1239: LoRA: improve LoCon support with other naming conventions#39

UPSTREAM PR #1239: LoRA: improve LoCon support with other naming conventions#39
loci-dev wants to merge 1 commit intomasterfrom
upstream-PR1239-branch_stduhpf-locon

loci-dev commented Jan 30, 2026

Uh oh!

loci-review bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 30, 2026

Uh oh!

loci-review bot commented Jan 30, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants