Benchmarking & Performance¶
This page is a living reference for measuring SML deployments. Contributions of methodology, scripts, or write-ups are welcome — see Contributing a benchmark at the bottom.
What to measure¶
Always report at least:
| Metric | Why it matters |
|---|---|
| TTFT (time-to-first-token) | User-visible latency; dominated by prefill. |
| Tokens / sec / replica | Throughput ceiling per replica. |
| Tokens / sec / GPU | Hardware efficiency; lets you compare layouts. |
| P50 / P95 / P99 latency | Tail behavior under load. |
| Concurrent requests | What input rate were these numbers measured at? |
A throughput number without the concurrency it was measured at is meaningless — always pair them.
Best practices¶
- Warm up first. Discard the first ~30 seconds of measurements: weights cache, NCCL channels, and the KV cache all need to settle.
- Use a realistic workload. Synthetic 100-token-in / 100-token-out benchmarks rarely match production. Capture or replay a real prompt distribution.
- Vary one thing at a time. Replicas × precision × batch size × context length is a 4D space; sweep one axis with the others fixed.
- Pin the framework version. Both sglang and vLLM iterate fast — record exact image tag / git SHA in your write-up.
- Match the partition’s nodes. Performance on
normalvs. a debug partition can differ; benchmarks should target the partition users will use. - Disable OpenTela for raw numbers. Pass
--disable-ocfto skip the OpenTela mesh registration on each replica (the flag name is historical —OCFandOpenTelaare the same thing). You then drive load directly to the framework’s host:port, which gives you the framework’s true throughput. Leave it on for end-to-end numbers that include the mesh + gateway hop. See When to disable OCF.
Observability¶
- Grafana — aggregated dashboards for SML jobs (and Kubernetes) are available on the metrics page. Note you need to be on VPN or internal network to view the metrics dashboard.
- Framework - vllm has metrics enabled by default, sglang must be enabled with
--enable-metricspassed as framework arg. - DCGM exporter — per-GPU metrics (SM utilization, memory bandwidth, NVLink, power). DCGM runs alongside the inference framework on each replica node; metrics are scraped to the same Grafana stack. Disable via
--disable-dcgm-exporterif needed.
Pre-canned methodology¶
More methodologies welcome — open a PR adding a section here.
(Placeholder — add a “How we measured X” subsection per benchmark you publish, with the exact sml advanced invocation, the load generator, and the resulting numbers.)
Contributing a benchmark¶
If you’ve run a serious benchmark and want it preserved here:
- Open a PR adding a new
## ...section to this file (or, for big write-ups, a sibling file likedocs/benchmarks/<topic>.mdlinked from here). - Include: model, framework version, GPU layout, exact
smlinvocation, load generator + workload, raw numbers, brief discussion. - If a chart helps, drop the source PNG / SVG under
docs/assets/and link it.
The goal is a small, browseable library of “we tried X, got Y” so the next person doesn’t redo the same experiment.
Next¶
- How to size a model — pick the layout you’re benchmarking
- Architecture — what’s actually in the request path