Skip to content

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324

Open
bradleyshep wants to merge 27 commits into
masterfrom
bradley/fix-validate-goldens-ci
Open

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324
bradleyshep wants to merge 27 commits into
masterfrom
bradley/fix-validate-goldens-ci

Conversation

@bradleyshep

Copy link
Copy Markdown
Contributor

Note 1: this requires a website PR to merge

Note 2:

I was able to run all workflow smoke tests successfully, including golden validation and dry-run benchmarks, except for the C# dry-run benchmark path. C# golden validation passes, but the C# benchmark dry run still fails intermittently/consistently on the runner despite several attempts to align its build/publish setup with the known-good smoketest path.

gh workflow run llm-benchmark-periodic.yml `
  --repo ClockworkLabs/SpacetimeDB `
  --ref bradley/fix-validate-goldens-ci `
  -f model_set=explicit `
  -f models="openrouter:openai/gpt-5.4-mini" `
  -f languages=rust,csharp,typescript `
  -f modes=guidelines `
  -f tasks=t_000_empty_reducers `
  -f dry_run=true

Description of Changes

This updates the LLM benchmark automation and runner plumbing.

  • Move periodic LLM benchmark and golden validation workflows from daily/nightly to weekly Monday UTC runs.
  • Add manual workflow inputs for benchmark smoke runs:
    • model set: website-managed, local defaults, or explicit models
    • languages, modes, categories, tasks
    • dry-run mode
  • Build the local TypeScript SDK before TypeScript benchmark/golden validation runs.
  • Add support for fetching active/available benchmark models from the website API via --model-source remote.
  • Keep explicit --models ... working for manual/local overrides.
  • Add OpenRouter preflight checks before benchmark execution:
    • checks key/account credits when available
    • probes the selected model when credit balance cannot be checked
    • supports OPENROUTER_ALLOW_UNCHECKED_CREDITS=1 escape hatch
    • supports OPENROUTER_MIN_CREDITS / LLM_MIN_CREDITS
  • Force scheduled benchmark workflow runs through OpenRouter with LLM_VENDOR=openrouter, while preserving direct OpenAI support for local/manual use.
  • Improve benchmark publishing isolation:
    • isolated SpacetimeDB CLI root per publish
    • serialized C# benchmark publish concurrency
    • local NuGet package references for generated C# benchmark projects
    • Windows/PATH handling for TypeScript pnpm
  • Update default benchmark model routes to current model names/ids.
  • Update TypeScript golden answers for current SDK shape.

API and ABI breaking changes

None.

This adds benchmark-runner/workflow behavior and CLI options, but does not change SpacetimeDB runtime API or ABI.

Expected complexity level and risk

3/5

The changes are mostly isolated to the LLM benchmark runner and GitHub workflows, but the risk is moderate because they touch CI execution paths, local SDK build assumptions, website-managed model resolution, OpenRouter routing, and generated module publish behavior across Rust, C#, and TypeScript.

The most sensitive pieces are:

  • GitHub Actions workflow dispatch/manual input behavior.
  • Remote model registry parsing from the website.
  • C# benchmark publish behavior on the self-hosted runner.

Testing

  • cargo check -p xtask-llm-benchmark --bin llm_benchmark
  • cargo test -p xtask-llm-benchmark --bin llm_benchmark
  • cargo test -p xtask-llm-benchmark parses_active_available_model_routes
  • Manual GitHub Actions golden validation smoke runs for Rust, C#, and TypeScript.
  • Run a dry-run periodic benchmark workflow from this branch with one explicit OpenRouter model, one task, and all languages.
  • Run a website-dispatched dry-run benchmark and verify it sends model_set=explicit plus selected model/task inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant