Skip to content

perf: --eager-load-params for fast steady-state streaming#1646

Closed
fszontagh wants to merge 2 commits into
leejet:masterfrom
fszontagh:perf/eager-load-params
Closed

perf: --eager-load-params for fast steady-state streaming#1646
fszontagh wants to merge 2 commits into
leejet:masterfrom
fszontagh:perf/eager-load-params

Conversation

@fszontagh

Copy link
Copy Markdown
Contributor

Summary

After #1644 centralized weight staging, params are loaded from disk to the params backend lazily on the first prepare_params call. For multi-segment streaming on a large model this means the first sampling step pays the entire disk-read cost (8-15 seconds per segment on Z-Image bf16), and batch images re-pay it whenever runner_done() releases the params storage.

This PR adds a sd_ctx_params_t::eager_load_params flag (CLI: --eager-load-params) that loads every registered tensor into the params backend right after metadata validation. Default off, so the lazy behavior is preserved for users who want lower peak host RAM at model-load time.

Numbers

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload Default (lazy) --eager-load-params
SDXL bf16 1152x896 batch=2 8 steps generate_image 21 s 17 s
Z-Image bf16 1024x688 batch=2 9 steps generate_image 359 s 58 s

For long-lived processes (servers, batch generation) the eager path also reduces total wallclock because images 2..N reuse the warm pinned-host cache instead of re-reading the model from disk.

Implementation

  • ModelManager::load_all_params_eagerly() collects all registered states and calls the existing load_tensors_to_params_backend.
  • Plumbed through sd_ctx_params_t::eager_load_params, init, and to_str.
  • CLI flag added in examples/common.

Checklist

@leejet

leejet commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Closing this as obsolete.

The original issue this PR tried to address no longer applies to the current default behavior. Params storage is kept after first use in the default params backend, so later runs can reuse the loaded weights. Storage is only released for tensors with Disk residency, which corresponds to --params-backend disk, and that reload/release behavior is expected for that mode.

So I don't think we need a separate --eager-load-params option anymore. Thanks for the contribution!

@leejet leejet closed this Jun 14, 2026
@fszontagh

fszontagh commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@leejet thanks for taking a look. After the merge with #1644 and the later refactors I tested again and you're correct that storage is kept after first use - release_params_storage_blocks only releases blocks where residency_mode == Disk, so batch image 2 reuses what image 1 already loaded.

But the first use is still lazy. With --stream-layers, every merged segment's first prepare_params call triggers load_tensors_to_params_backend for tensors it hasn't touched yet, and that work lands inside the first sampling step.

Per-call probe inside prepare_params on the current merged branch, Z-Image bf16 1024x688 batch=2 2 steps, RTX 3060 12 GB:

stage calls load time per call total
image 1 step 1 cold loads ~36 one ~30s, one ~40s, ~32 small ~3s each ~180s
image 1 step 2+ all 0 ms 0
image 2 entirely all 0 ms 0

So storage is kept (great), but the cold disk-read cost is paid in the first sampling step. For users on --stream-layers (the perf-sensitive path that motivated #1612, #1601, #1611) this is the slow first step that's hard to avoid without a hook to load up front.

With --eager-load-params the same disk reads happen at model load instead, and step 1 becomes the same speed as step 2+:

default (lazy) --eager-load-params
Z-Image bf16 1024x688 batch=2 9 steps generate_image 244 s 63 s
SDXL bf16 1152x896 batch=2 8 steps generate_image 21 s 17 s

If you'd prefer a different shape (e.g. always-on when `stream_layers && params_backend is cpu`, or an `--params-backend cpu-eager` token instead of a new flag), happy to rework. Or if you think paying the lazy first-load is the right default and users should accept it for the smaller startup RAM, I'll close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants