Skip to content
ClankerBuilder
Sign in

About

Tok/s Methodology

How we estimate decode throughput for local LLM inference · Last updated Jun 9, 2026

15% of top-20 GPU ratings carry HIGH confidence (3 total HIGH of 24 materialized).

Price data

Prices are sourced via the sovrn aggregator adapter. Offers older than 48 hours are marked STALE until the next cron refresh.

What tok/s measures

Tok/s measures how fast an LLM generates output text during inference. Specifically, tokens per second (tok/s) is the decode throughput of an LLM runtime — how fast the model generates output tokens after the prompt is loaded. Prefill (prompt processing) speed is tracked separately as it dominates time-to-first-token (TTFT) for long contexts.

Data sources

ClankerBuilder aggregates tok/s ratings from community benchmarks, lab tests, and spec-based estimates. All data comes from third-party sources — we never publish our own benchmark numbers.

  • Community benchmarks — LocalLLaMA, llama.cpp issue threads, and user-submitted runs (normalized to Q4_K_M, batch 1).
  • Lab benches — Controlled runs on reference hardware with fixed prompt/decode token counts.
  • Spec formula fallback — When no observation exists, a bandwidth-limited estimate is used and marked ESTIMATED.

Performance benchmarks

For interactive LLM use, 20-30 tok/s feels responsive, while production workloads prefer 50+ tok/s. High-end consumer GPUs like the RTX 4090 typically achieve 60-80+ tok/s on smaller models like Llama 3.1 8B. Larger models (70B parameters) may see 5-15 tok/s on consumer hardware, requiring multi-GPU setups for faster inference.

Disclaimer

Tok/s ratings are indicative estimates. Actual performance varies with driver versions, runtime builds, context length, concurrent workloads, and thermal limits. Always validate on your own hardware before production deployment.