Skip to content

UPSTREAM PR #1247: sd-server: set cfg_scale in the guidance parameters#49

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1247-cfg_scale_pr
Open

UPSTREAM PR #1247: sd-server: set cfg_scale in the guidance parameters#49
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1247-cfg_scale_pr

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 3, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1247

What:
sd_xl_turbo_1.0_fp16 generates low-quality images.

Why:
cfg_scale is not being passed along from sd-server to stable-diffusion
sd_xl_turbo_1.0_fp16 requires setting cfg_scale to 1.0 for good result

How:
This update passes along cfg_scale from server's json request.value to stable-diffusion's gen_params.sample_params.guidance.txt_cfg parameter.

@loci-review
Copy link

loci-review bot commented Feb 3, 2026

Overview

Analysis of 48,154 functions across build.bin.sd-server and build.bin.sd-cli reveals net positive performance with 131 modified, 60 new, and 132 removed functions. Power consumption improved by -1.745% (512,977 nJ → 504,025 nJ) for build.bin.sd-server and -1.866% (479,167 nJ → 470,226 nJ) for build.bin.sd-cli. The single commit between versions modified guidance parameter handling (cfg_scale), unrelated to observed performance changes.

Function Analysis

ggml_compute_forward_flash_attn_ext_f16 (both binaries) shows exceptional improvement: response time decreased -47% (25,524 ns → 13,500 ns for server, 25,565 ns → 13,527 ns for CLI), saving ~12,000 ns per call. This performance-critical flash attention function is called hundreds of times per image generation across text encoding and denoising steps. The optimization originates from ggml submodule updates, not application code changes.

Multiple STL functions show significant regressions: __iter_equals_val (+237% response time, +185 ns), end() methods (+228% response time, +183 ns), and _S_key (+164% response time, +187 ns). These standard library functions experienced 200-300% throughput time increases likely due to compiler/toolchain differences rather than source code modifications. While percentages are high, absolute impacts are small (150-200 ns per call) and occur in non-critical paths.

path_str functions show +225% response time (+3,360 ns) but affect only initialization (backend registration), not inference hot paths. _M_destroy for Conv2d shows +180% throughput time (+189 ns) affecting object cleanup. make_shared instantiations show +113% throughput time but only +8-9% response time, indicating allocation overhead increases with minimal downstream impact.

Additional Findings

The flash attention optimization dominates the performance profile, providing 7-15 milliseconds improvement per image generation. This ML-critical function benefits text encoding (CLIP/T5) and all attention operations in U-Net/DiT/Flux architectures across 20-50 diffusion steps. The 47% speedup in this hot path far outweighs cumulative STL regressions (~20-50 microseconds in initialization, ~10-20 microseconds in inference), resulting in measurable end-to-end inference acceleration and reduced power consumption. The optimization scales with model complexity and resolution, providing greater benefits for larger models and higher-resolution generation.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 76645dd to 5bbc590 Compare February 7, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant