Question 1

What is tok/s?

Accepted Answer

Tokens per second (tok/s) is the decode throughput of an LLM runtime — how fast the model generates output tokens after the prompt is loaded. It measures the speed of text generation during inference.

Question 2

How does ClankerBuilder estimate tok/s ratings?

Accepted Answer

We aggregate community benchmarks from LocalLLaMA, llama.cpp issue threads, controlled lab measurements, and user-submitted runs. When no observation exists, we use a bandwidth-limited estimate marked as 'ESTIMATED'.

Question 3

What's a good tok/s for local LLM inference?

Accepted Answer

For interactive use, 20-30 tok/s feels responsive. For production workloads, 50+ tok/s is preferred. High-end GPUs like the RTX 4090 achieve 60-80+ tok/s on smaller models like Llama 3.1 8B.

Question 4

How accurate are the tok/s estimates?

Accepted Answer

Tok/s ratings are indicative estimates. Actual performance varies with driver versions, runtime builds, context length, concurrent workloads, and thermal limits. Always validate on your target hardware.

Question 5

Where do GPU prices come from?

Accepted Answer

Prices are sourced via the sovrn aggregator adapter. Offers older than 48 hours are marked as 'STALE' until the next refresh.

Tok/s Methodology

Price data

What tok/s measures

Data sources

Performance benchmarks

Disclaimer