How to Size a Model for a GPU / Node¶

Picking the right replica count and nodes-per-replica boils down to: does the model fit in VRAM, and how much spare VRAM do I want for the KV cache?

Step 1 — VRAM the weights need¶

Rough formula:

weights_bytes ≈ params × bytes_per_param

Precision	Bytes / param	Example: 70B model
FP32	4	280 GB
BF16/FP16	2	140 GB
FP8	1	70 GB
INT4	0.5	35 GB

Add ~20% overhead for activations, framework buffers, and CUDA workspaces.

Step 2 — VRAM the KV cache needs¶

KV cache scales with concurrent sequences and context length. For a transformer:

kv_bytes_per_token ≈ 2 × num_layers × hidden_dim × kv_heads/heads × bytes_per_param

Then:

kv_total ≈ kv_bytes_per_token × max_concurrent_tokens

Where max_concurrent_tokens is roughly max_batch × max_seq_len. If you’re not sure, start by reserving 30–50% of VRAM for the KV cache — both sglang and vLLM size their cache to fill what’s left after weights.

Step 3 — pick a GPU layout¶

CSCS GH200 nodes have 4 GPUs at ~96 GB each (~384 GB per node).

Model size (BF16)	Fits where	Layout
≤ 30 B	1 GPU	`--slurm-replicas N --slurm-nodes-per-replica 1`, set framework `--tp-size 1`
30–80 B	1 node (4-way TP)	1 replica per node, framework `--tp-size 4`
80–250 B	1 node (4-way TP) at FP8, or 2 nodes BF16	quantize, or `--slurm-nodes-per-replica 2` + matching TP
250 B+	Multiple nodes	`--slurm-nodes-per-replica 2+`, expect tensor + pipeline parallelism

Parallelism: DP / TP / PP / EP — and why DP is replicas¶

Four flavors of parallelism show up when serving large models:

Term	What it splits across GPUs	Where SML expresses it
TP (tensor parallelism)	A single matmul, sharded across GPUs within a layer	Framework flag (e.g. sglang/vLLM `--tp-size`) inside `--framework-args`. Stays inside one replica.
PP (pipeline parallelism)	Layers, sharded across GPUs (or nodes) end-to-end	Framework flag (e.g. `--pp-size`) inside `--framework-args`. Spans nodes within one replica when `--slurm-nodes-per-replica > 1`.
EP (expert parallelism)	MoE experts, sharded across GPUs — only meaningful for MoE models	Framework flag (e.g. vLLM/sglang `--ep-size` or `--enable-expert-parallel`) inside `--framework-args`. Stays inside one replica.
DP (data parallelism)	Independent copies serving different requests in parallel	`--slurm-replicas N` — N copies of the model, optionally fronted by `--use-router`.

In short: a “replica” in SML is a DP unit. TP, PP, and EP are framework-internal — they affect how one replica is laid out across its allocated GPUs/nodes. DP is just “how many replicas”.

A note on dense models in Kubernetes¶

For dense models (one weight matrix per layer, no MoE routing), DP isn’t usually expressed inside the inference framework — you don’t tell the framework “give me 4 data-parallel copies on these 4 GPUs”. You just request a single GPU per replica and let the autoscaler add more replicas when load grows. The orchestrator (k8s, or here, SLURM + --slurm-replicas) provides DP naturally; the framework only handles TP (and PP when needed).

This shapes the rule below: bump --slurm-replicas for throughput, not the framework’s DP flags.

MoE models change the picture¶

For Mixture-of-Experts models (Mixtral, DeepSeek-V3, GLM-4.5/5, Qwen-MoE, …), the choice between TP and EP matters:

TP shards each expert’s weight matrices across GPUs. Communication is on the critical path of every token.
EP keeps each expert whole on one GPU and routes tokens to the GPU that owns the expert they were assigned to. Communication is one all-to-all per MoE layer, but per-expert matmuls stay local.

Rule of thumb: for MoE models with many experts and modest expert size, prefer EP over TP within a replica — it’s typically faster on multi-GPU nodes. Use TP for the dense (attention) parts and EP for the MoE feed-forward parts when the framework supports it (most modern serving stacks do).

DP across replicas still applies the same way for throughput: more concurrent requests → bump --slurm-replicas.

Step 4 — replicas vs. nodes-per-replica¶

These two flags set very different things:

--slurm-replicas N — N independent copies of the model. Use for throughput: more concurrent requests, optionally fronted by --use-router for load balancing.
--slurm-nodes-per-replica K — each replica spans K nodes. Use when one replica doesn’t fit on a single node (large models, long context, more KV cache).

Total nodes = replicas × nodes-per-replica.

Rule of thumb:

Model fits on 1 node, want more throughput? → bump --slurm-replicas.
Model doesn’t fit on 1 node? → bump --slurm-nodes-per-replica first, then add replicas if you still need throughput.

Step 5 — sanity-check before submitting¶

Time limit (--slurm-time) covers warm-up + your workload + a margin. Cold start of a multi-node deployment can take sometimes up to 40 minutes (e.g. Kimi-k2.5 1T params).
Partition matches the GPU layout you’re asking for.
KV cache leaves room for your max sequence length × max batch.

Latency tuning¶

Use this when a single user is waiting for a response — chat, interactive demos, copilot-style autocomplete. The metric to optimize is TTFT and per-token latency at low concurrency.

Knob	Recommended for low latency
Model size	The smallest model that meets your quality bar. A well-tuned 8B is faster than a clumsily-tuned 70B.
Precision	FP8 or INT4 if accuracy holds. Less VRAM read per token = faster.
Replicas	1. More replicas help throughput, not single-request latency.
Router	Off (`--use-router` not set). The router adds a hop.
Framework batching	Keep `--max-num-seqs` low (e.g. 8) so requests don’t queue behind a giant batch.
Context length	Cap `--max-model-len` to what you actually need. Smaller KV cache = faster prefill.
TP	Just enough to fit the model. Past that, TP communication starts costing more than it saves.
OCF	If you’re driving load directly from another job on the cluster, `--disable-ocf` removes the mesh hop. For end-user traffic via the public gateway, keep it on.

Measure TTFT and P50/P99 at concurrency = 1 and concurrency = your realistic ceiling — they will tell different stories. See Benchmarking.

Throughput tuning¶

Use this when you have a lot of work to push through — batch eval, dataset processing, offline scoring. The metric to optimize is tokens/sec aggregated across all requests.

Knob	Recommended for high throughput
Replicas	More. Bump `--slurm-replicas` until you hit a partition or budget cap. DP scales linearly.
Router	On (`--use-router`). Spreads load across replicas; without it you have to load-balance yourself.
Framework batching	Crank `--max-num-seqs` (e.g. 256+) so the framework can group requests into fat batches.
KV cache headroom	Leave more VRAM for the cache. Bigger cache = more concurrent sequences = more batching opportunity.
Precision	FP8 if quality allows — smaller weights leave more room for KV cache and increase batch size.
Context length	Cap `--max-model-len` to the longest request you’ll actually send. Wasted KV cache = lost batch slots.
Concurrency at the client	Don’t ramp slower than the server can absorb — keep ≥ `replicas × max-num-seqs` requests in flight.

If you’re benchmarking, disable OCF to take the mesh hop out of the measurement (see When to disable OCF).

When in doubt¶

Start with one replica on one node at the lowest precision your accuracy budget tolerates. Measure (see Benchmarking). Scale from there.

Next¶

Benchmarking — measure before scaling
Advanced Usage — the flags above in context