LLM benchmarks: run weekly LLM benchmarks from website-managed models by bradleyshep · Pull Request #5324 · clockworklabs/SpacetimeDB

bradleyshep · 2026-06-15T14:05:12Z

Note 1: this requires a website PR to merge

Note 2:

I was able to run all workflow smoke tests successfully, including golden validation and dry-run benchmarks, except for the C# dry-run benchmark path. C# golden validation passes, but the C# benchmark dry run still fails intermittently/consistently on the runner despite several attempts to align its build/publish setup with the known-good smoketest path.

gh workflow run llm-benchmark-periodic.yml `
  --repo ClockworkLabs/SpacetimeDB `
  --ref bradley/fix-validate-goldens-ci `
  -f model_set=explicit `
  -f models="openrouter:openai/gpt-5.4-mini" `
  -f languages=rust,csharp,typescript `
  -f modes=guidelines `
  -f tasks=t_000_empty_reducers `
  -f dry_run=true

Description of Changes

This updates the LLM benchmark automation and runner plumbing.

Move periodic LLM benchmark and golden validation workflows from daily/nightly to weekly Monday UTC runs.
Add manual workflow inputs for benchmark smoke runs:
- model set: website-managed, local defaults, or explicit models
- languages, modes, categories, tasks
- dry-run mode
Build the local TypeScript SDK before TypeScript benchmark/golden validation runs.
Add support for fetching active/available benchmark models from the website API via --model-source remote.
Keep explicit --models ... working for manual/local overrides.
Add OpenRouter preflight checks before benchmark execution:
- checks key/account credits when available
- probes the selected model when credit balance cannot be checked
- supports OPENROUTER_ALLOW_UNCHECKED_CREDITS=1 escape hatch
- supports OPENROUTER_MIN_CREDITS / LLM_MIN_CREDITS
Force scheduled benchmark workflow runs through OpenRouter with LLM_VENDOR=openrouter, while preserving direct OpenAI support for local/manual use.
Improve benchmark publishing isolation:
- isolated SpacetimeDB CLI root per publish
- serialized C# benchmark publish concurrency
- local NuGet package references for generated C# benchmark projects
- Windows/PATH handling for TypeScript pnpm
Update default benchmark model routes to current model names/ids.
Update TypeScript golden answers for current SDK shape.

API and ABI breaking changes

None.

This adds benchmark-runner/workflow behavior and CLI options, but does not change SpacetimeDB runtime API or ABI.

Expected complexity level and risk

3/5

The changes are mostly isolated to the LLM benchmark runner and GitHub workflows, but the risk is moderate because they touch CI execution paths, local SDK build assumptions, website-managed model resolution, OpenRouter routing, and generated module publish behavior across Rust, C#, and TypeScript.

The most sensitive pieces are:

GitHub Actions workflow dispatch/manual input behavior.
Remote model registry parsing from the website.
C# benchmark publish behavior on the self-hosted runner.

Testing

cargo check -p xtask-llm-benchmark --bin llm_benchmark
cargo test -p xtask-llm-benchmark --bin llm_benchmark
cargo test -p xtask-llm-benchmark parses_active_available_model_routes
Manual GitHub Actions golden validation smoke runs for Rust, C#, and TypeScript.
Run a dry-run periodic benchmark workflow from this branch with one explicit OpenRouter model, one task, and all languages.
Run a website-dispatched dry-run benchmark and verify it sends model_set=explicit plus selected model/task inputs.

bradleyshep added 25 commits June 10, 2026 15:05

updates

a6382ca

Update provider.rs

711ff88

updates

e82f0ae

preflight credit checks; workflow update to use web

bcdb41d

weekly goldens; workflow refinements

f2179a2

Update publishers.rs

8d1d27e

golden fixes

d5957f2

Merge branch 'master' into bradley/fix-validate-goldens-ci

f1ae445

fixes

4c679e2

Update publishers.rs

4358ed5

updates

890be18

Update publishers.rs

480cedf

fixes

d4999e2

Update publishers.rs

e58523f

fixes

032afd1

Merge branch 'master' into bradley/fix-validate-goldens-ci

9eee265

match smoketest (fingers crossed?)

6037418

fix

2e6e02f

shrug

b2308b1

fix?

2b133b8

testing

7857671

test

ee38f7a

Update llm-benchmark-periodic.yml

77e2924

revert tests

63a9c34

preflight no error; vendor to openrouter in periodic

9596077

bradleyshep requested review from bfops, cloutiertyler and jdetter as code owners June 15, 2026 14:05

bradleyshep added 2 commits June 15, 2026 10:11

lints

65e4539

Merge branch 'master' into bradley/fix-validate-goldens-ci

4272cfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324
bradleyshep wants to merge 27 commits into
masterfrom
bradley/fix-validate-goldens-ci

bradleyshep commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bradleyshep commented Jun 15, 2026

Note 1: this requires a website PR to merge

Note 2:

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant