Skip to content

UPSTREAM PR #1124: feat: support for cancelling generations#44

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1124-sd_cancel
Open

UPSTREAM PR #1124: feat: support for cancelling generations#44
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1124-sd_cancel

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1124

Adds an sd_cancel_generation function that can be called asynchronously to interrupt the current generation.

The log handling is still a bit rough on the edges, but I wanted to gather more feedback before polishing it. I've included a flag to allow finer control of what to cancel: everything, or keep and decode already-generated latents but cancel the current and next generations. Would an extra "finish the already started latent but cancel the batch" mode be useful? Or should I simplify it instead, keeping just the cancel-everything mode?

The function should be safe to be called from the progress or preview callbacks, a separate thread, or a signal handler. I've included a Unix signal handler on main.cpp just to be able to test it: the first Ctrl+C cancels the batch and the current gen, but still finishes the already generated latents, while a second Ctrl+C cancels everything (although it won't interrupt it in the middle of a generation step anymore).

fixes #1036

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of 47,950 functions across two binaries reveals minimal net performance impact between versions. Modified functions: 75 (0.16%), new: 59, removed: 31, unchanged: 47,785 (99.66%).

Power Consumption:

  • build.bin.sd-cli: 469,513.77 nJ (base: 469,680.15 nJ, -0.035%)
  • build.bin.sd-server: 502,702.24 nJ (base: 502,761.30 nJ, -0.012%)

Both binaries show negligible power consumption changes, indicating balanced performance across modifications.

Function Analysis

Most performance variations occur in C++ Standard Library functions and external GGML library code rather than application code. The primary code change—adding atomic-based cancellation support—has minimal direct performance impact.

Significant Regressions:

  • std::_Rb_tree::end() (build.bin.sd-cli): Response time +183.29 ns (+227.95%), throughput time +183.29 ns (+306.60%). STL function regression likely from compiler optimization differences.
  • ggml_view_2d (build.bin.sd-cli): Throughput time +32.12 ns (+47.00%), response time +24.03 ns (+1.16%). Critical tensor reshaping operation used extensively in attention mechanisms.
  • gguf_writer::write (build.bin.sd-cli): Throughput time +85.70 ns (+43.14%), response time +88.51 ns (+0.61%). Affects model serialization, not inference hot paths.
  • ggml_vec_scale_f16 (build.bin.sd-cli): Throughput time +76.95 ns (+8.66%), response time +76.99 ns (+5.62%). SIMD vector scaling operation in inference path.

Significant Improvements:

  • std::_Rb_tree::_M_insert_unique() (build.bin.sd-cli): Throughput time -90.67 ns (-46.04%), response time -91.62 ns (-5.20%). STL red-black tree insertion optimization.
  • std::unordered_map::operator[] (build.bin.sd-cli): Throughput time -62.72 ns (-32.31%), response time -64.65 ns (-1.16%). Benefits LoRA state management operations.
  • std::map::operator[] (build.bin.sd-cli): Throughput time -61.63 ns (-28.93%), response time -62.83 ns (-1.47%). Improves parameter lookup operations.

Other analyzed functions showed minor changes in STL container operations, quantization validation, and memory management, with absolute impacts under 50 ns per call.

Additional Findings

ML tensor operations show modest cumulative regressions. The combination of ggml_view_2d (+32 ns), ggml_vec_scale_f16 (+77 ns), and apply_unary_op (+71 ns) adds approximately 180 ns overhead per operation set. For diffusion models with multiple attention layers and denoising steps, this could accumulate to low milliseconds per generation. However, improvements in LoRA state management and container operations partially offset these regressions. All changes originate from external GGML library or compiler optimizations rather than application code modifications.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from a234621 to d762b55 Compare February 5, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants