UPSTREAM PR #1184: Feat: Select backend devices via arg#40
UPSTREAM PR #1184: Feat: Select backend devices via arg#40
Conversation
052ebb0 to
76ede2c
Compare
fix sdxl conditionner backends fix sd3 backend display
29e8399 to
2d43513
Compare
OverviewAnalysis of stable-diffusion.cpp across 18 commits reveals minimal performance impact from multi-backend device management refactoring. Of 48,425 total functions, 124 were modified (0.26%), 331 added, and 109 removed. Power consumption increased negligibly: build.bin.sd-cli (+0.388%, 479,167→481,028 nJ) and build.bin.sd-server (+0.239%, 512,977→514,202 nJ). Function AnalysisSDContextParams Constructor (both binaries): Response time increased ~40% (+2,816-2,840ns) due to initializing 9 new SDContextParams Destructor (both binaries): Response time increased ~42% (+2,497-2,505ns) from destroying 9 additional string members. One-time cleanup cost outside inference paths. ~StableDiffusionGGML (both binaries): Throughput time increased ~95% (+192ns absolute) managing 7 backend types versus 3, including loop-based cleanup for multiple CLIP backends. Response time impact minimal (+5.2%, ~720ns). ggml_e8m0_to_fp32_half (sd-cli): Response time improved 24% (-36ns), benefiting quantization operations called millions of times during inference. Standard library functions (std::_Rb_tree::begin, std::vector::_S_max_size, std::swap): Showed 76-289% throughput increases due to template instantiation complexity, but absolute changes remain under 220ns in non-critical initialization paths. Additional FindingsAll performance regressions occur in initialization and cleanup phases, not inference hot paths. The architectural changes enable multi-GPU workload distribution, per-component device placement (diffusion, CLIP, VAE on separate devices), and runtime backend flexibility. Quantization improvements and multi-GPU capabilities provide net performance gains during actual inference, far exceeding the microsecond-level initialization overhead. Changes are well-justified architectural improvements with negligible real-world impact. 🔎 Full breakdown: Loci Inspector. |
76645dd to
5bbc590
Compare
Note
Source pull request: leejet/stable-diffusion.cpp#1184
The main goal of this PR is to improve user experience in multi-gpu setups, allowing to chose which model part gets sent to which device.
Cli changes:
--main-backend-device [device_name]argument to set the default backend--clip-on-cpu,--vae-on-cpuand--control-net-cpuarguments--clip_backend_device [device_name],--vae-backend-device [device_name],--control-net-backend-device [device_name]arguments--diffusion_backend_device(control the device used for the diffusion/flow models) and the--tae-backend-device--upscaler-backend-device,--photomaker-backend-device, and--vision-backend-device--list-devicesargument to print the list of available ggml devices and exit.--rpcargument to connect to a compatible GGML rpc serverC API changes (stable-diffusion.h):
sd_ctx_params_tstruct.void list_backends_to_buffer(char* buffer, size_t buffer_size)to write the details of the available buffers to a null-terminated char array. Devices are separated by newline characters (\n), and the name and description of the device are separated by\tcharacter.size_t backend_list_size()to get the size of the buffer needed for void list_backends_to_buffervoid add_rpc_device(const char* address);connect to a ggml RPC backend (from llama.cpp)The default device selection should now consistently prioritize discrete GPUs over iGPUs.
For example if you want to run the text encoders on CPU, you'd need to use
--clip_backend_device CPUinstead of--clip-on-cpuTODO:
--lora-apply-mode immediatelywhen clip and diffsion models are running on different (non-cpu) backends.Important: to use RPC, you need to add
-DGGML_RPC=ONto the build. Additionally it requires either sd.cpp to be built with-DSD_USE_SYSTEM_GGMLflag (I haven't tested that one), or the RPC server to be built with-DCMAKE_C_FLAGS="-DGGML_MAX_NAME=128" -DCMAKE_CXX_FLAGS="-DGGML_MAX_NAME=128"(default is 64)Fixes #1116