Discover how GPU hardware, memory bandwidth, and token economics shape modern LLM serving. Learn the formulas behind AI API pricing and why inference infrast...
LLM Inference Infrastructure: The Hidden Economics Behind AI Token Pricing
Core Insights
- Inference now dominates over training in determining LLM economics and model architecture design
- Two critical factors shape inference costs: computation time (T_compute) and memory access time (T_mem), with whichever takes longer determining total latency
- Batch size optimization directly correlates with sparsity ratios—denser expert models require smaller batches, while sparse models enable serving 2,400+ users per 20ms cycle
- The 300x FLOPs-to-bandwidth ratio remains constant across hardware generations, explaining why inference latency stabilizes around 20-30 milliseconds
- Context length pricing tiers (like the 200k token boundary) emerge directly from hardware bottlenecks, not arbitrary business decisions
- Modern serving infrastructure (vLLM, SGLang) represents the true competitive moat for frontier AI labs, not model training capability
Understanding Transformer Architecture and Inference Fundamentals
The transformer architecture forms the foundation of all modern large language models, but understanding how inference actually works requires grasping several interconnected concepts that most observers miss. When you paste code into Claude Code or ask ChatGPT a question, you're not triggering a single monolithic computation. Instead, the system executes two distinct phases: prefill and ** decode**.
During prefill, your entire input—whether it's a 10-word question or a 50,000-token codebase—passes through the transformer in parallel. This is computationally efficient because every token in your input can be processed simultaneously. The transformer simultaneously calculates something called QKV (Query, Key, Value) representations for every token. The Query acts like a question each token asks about context, while Key and Value pairs function as a searchable knowledge base that grows with each position. This is where the critical ** KV cache** gets created—essentially a compressed memory of everything the model has seen so far.
What makes KV cache so important? Consider a 1,000-token input. The transformer produces 1,000 sets of Key-Value pairs, one per token per layer. If your model has 61 layers (like DeepSeek V3), that's 61,000 cached KV pairs just for prefill. When you're using a model with 8 billion hidden dimensions, each KV pair consumes significant memory. This cached information is what enables the decode phase.
After prefill completes, the model generates one new token at a time—autoregressive generation. For each new token, the system only needs to compute a fresh Query representation for that single new token, then match it against the entire cached Key-Value history. This is why decode is so different from prefill: you're computing attention over massive cached data using minimal new input. The efficiency comes from reusing the KV cache, avoiding recalculation of everything that came before.
This architectural distinction between prefill and decode creates fundamental infrastructure challenges. During prefill, computation dominates—you're doing heavy mathematical operations on your entire input at once. During decode, memory access dominates—you're loading enormous cached tensors into GPU memory to compare against a single query token. These represent opposite optimization problems, which is why modern LLM serving requires sophisticated scheduling to handle both simultaneously.
The transformer itself consists of stacked identical blocks—DeepSeek has 61, GPT-3 has around 90. Each block contains two main components: an attention mechanism that determines which information is relevant, and a multi-layer perceptron (MLP) or mixture-of-experts (MoE) block that processes that information. The attention mechanism creates interdependencies between tokens, allowing the model to build contextual understanding. The MLP/MoE block applies expert knowledge, with modern sparse models only activating a fraction of available experts per token.
The Mathematics of Inference Time: Compute vs. Memory Bottlenecks
Modern LLM inference performance can be rigorously analyzed through two competing time constraints that reveal why current AI pricing exists. The total time T required to process a batch is determined by whichever takes longer: the time required for computation (T_compute) or the time required to access memory (T_mem). This fundamental principle explains everything from why input tokens cost less than output tokens to why some providers tier pricing at 200,000 tokens.
T_compute represents the time needed to perform all mathematical operations. In inference, this scales linearly with batch size and the number of active model parameters:
T_compute = (Batch × N_active) / FLOPs_per_second
Here, "active parameters" refers to which experts are actually used. With mixture-of-experts models, you might have 384 total experts but only activate 6 for a given token. This sparsity is crucial—it allows models to scale to enormous parameter counts while keeping compute manageable. A 5-trillion parameter model with 1/8 sparsity effectively computes only 625 billion parameters per token. This is why recent models suddenly jumped to 5T+ parameters: the underlying hardware now supports handling such sparse computation efficiently.
T_mem represents the time needed to load model weights and the KV cache into GPU memory for computation:
T_mem = (N_total + Batch × Context_length) / Memory_bandwidth_bytes_per_second
This formula reveals critical truths. N_total is the full model size—you must load all weights regardless of sparsity. Then you must load KV cache for every token in every user's context for every user in the batch. A single batch of 2,400 users in decode phase (one token per user) requires loading 2,400 KV cache entries. If users have varying context lengths—some at 1,000 tokens, others at 100,000—memory bandwidth becomes the bottleneck.
The intersection point where T_compute equals T_mem determines maximum efficiency. Modern GPUs achieve this balance when:
Batch_size = (FLOPs / Memory_bandwidth) × Sparsity_ratio
With current hardware maintaining a consistent ~300x ratio between FLOPs and memory bandwidth (in FP4 precision), and typical sparsity of 1/8 to 1/12, optimal batch sizes range from 2,400 to 3,000 tokens per cycle. This isn't arbitrary—it emerges directly from hardware specifications.
The 20-Millisecond Cycle: Infrastructure's Heartbeat
Understanding why inference systems operate on a 20-millisecond cycle requires examining hardware memory architecture. A single high-bandwidth memory (HBM) chip on modern GPUs (like NVIDIA's GB300 with 288GB of HBM) has a bandwidth of approximately 20 terabytes per second. The drain time—the time required to transfer the entire HBM capacity at maximum bandwidth—works out to:
Drain_time = 288GB / (20TB/s) ≈ 14-20 milliseconds
This natural hardware rhythm becomes the inference cycle. Every 20ms, the GPU completes all pending compute operations, outputs results, and readies the next batch. Everything in modern LLM serving infrastructure is optimized around this 20ms boundary. Tasks are scheduled to complete within this window. Users' tokens are grouped into batches that process together. The orchestration layer tracks which tokens belong to which users and reconstructs KV cache pointers appropriately.
This explains why frontier labs like Anthropic and OpenAI publish token processing figures—these directly indicate the number of 20ms cycles they're running and therefore the size of their GPU infrastructure. If Google announces processing 10 million tokens per second, that's 10 million tokens per 0.02 seconds, or 200 million tokens per second. Divide by tokens-per-cycle, and you can estimate their total rack count.
Token Economics: Reverse Engineering API Pricing from Hardware Constraints
The pricing you see in LLM APIs—different costs for input versus output tokens, tier breaks at specific context lengths—isn't arbitrary marketing. These prices reflect deeply calculated hardware economics that frontier labs prefer not to advertise explicitly. However, the mathematics are entirely deducible from public information.
Input tokens cost less than output tokens because of prefill efficiency. During prefill, your entire input processes in parallel. Thousands of users can have their prefill operations batched together within a single computation cycle. Output tokens require sequential generation in decode phase, where each token ties up substantially more computation resource relative to memory bandwidth. A user generating 1,000 output tokens consumes 1,000 separate decode cycles, each taking 20ms minimum. The same user providing 1,000 input tokens consumes perhaps 1-10 decode cycles depending on batching.
Context length pricing tiers emerge from hardware saturation. Below a certain context threshold (typically around 200,000 tokens based on current hardware), additional context doesn't significantly impact serving efficiency. The KV cache fits comfortably in GPU memory, and prefetch operations align efficiently with compute cycles. However, beyond this threshold, context length begins consuming memory bandwidth at a rate that outpaces computation capability. The system becomes memory-bound rather than compute-bound.
When memory becomes the bottleneck, serving additional users with long context requires reducing batch size—exactly as our earlier formula predicted. Fewer users served per 20ms cycle means spreading fixed hardware costs across fewer token generations. From a provider's perspective, the cost per token rises sharply beyond 200k context, justifying higher API pricing. Some frontier labs charge 1.5x or even 2x more for context beyond this threshold.
Five-minute and one-hour cache pricing reflects data lifecycle economics. When you pause using Claude Code for five minutes, your KV cache doesn't immediately disappear. The GPU farm keeps it in high-bandwidth memory (HBM) to enable instant resumption. This is profitable because loading the cache from HBM takes nanoseconds, while other users' new prefill requests could never fully utilize freed GPU capacity within five minutes. However, keeping the cache in HBM for an hour becomes expensive—GPUs could serve multiple different users in that time. So the system "demotes" old cache to DRAM (the CPU's main memory), then to SSD, then to traditional storage. Each demotion tier costs progressively less but requires time to rehydrate on-demand.
A frontier lab's profit margin depends entirely on maximizing the utilization of their expensive infrastructure. Every idle computation cycle, every underutilized GPU, and every wasted memory bandwidth represents direct financial loss. This is why the engineering that enables packing 2,400+ users' tokens into each 20ms cycle matters more than any individual algorithm or model architecture innovation.
Modern Serving Infrastructure: The True Competitive Advantage
While model training captures public attention, the actual competitive moat for frontier labs increasingly lies in serving infrastructure. Companies like Anthropic and OpenAI invest enormous resources in systems like vLLM and SGLang—frameworks that coordinate millions of tokens flowing through thousands of GPUs simultaneously.
The core challenge these frameworks solve is reconciling the theoretical optimal batch size with real-world user diversity. The formula suggests 2,400 tokens per cycle is optimal. But your users aren't homogeneous. Some paste "hello world" into Claude. Others throw entire codebases (100k+ tokens) into Codex. Some are in active conversation (decode only). Others are prefilling massive contexts. Some have been inactive for hours (cache evicted). Some arrived 30 seconds ago (cache hot in HBM).
PagedAttention, vLLM's foundational innovation, solved this by treating KV cache like virtual memory. Instead of requiring each user's cache to occupy contiguous GPU memory (with padding for short contexts, waste for misaligned boundaries), PagedAttention divides cache into fixed-size pages. A pointer system tracks where each user's pages live. During computation, the system efficiently gathers scattered pages from memory, performs vectorized operations treating them as one contiguous block, and writes results back. This reduces memory waste dramatically while enabling denser packing of heterogeneous workloads.
Chunked prefill handles the challenge of users with enormous contexts. Rather than forcing the system to choose between processing one user's entire 50,000-token prefill (monopolizing GPU resources) or breaking QKV consistency, modern systems split large prefills into chunks. A 50,000-token input might be broken into fifty 1,000-token chunks, interleaved with decode tokens from other users. Each chunk generates partial KV cache. By the time prefill completes (over many 20ms cycles), cache is complete, and decode can proceed.
Sophisticated scheduling coordinates all this orchestration. The kernel scheduler must continuously solve an optimization problem: given the current user queue (some decoding, some prefilling, some resuming from cache), what batch composition maximizes utilization within the next 20ms cycle? Add 2,000 decode tokens? Or 1,500 decode plus 500 of a prefill chunk? The optimal choice changes based on remaining cache status, user patience (older requests first), and available GPU memory.
This infrastructure remains largely proprietary. While vLLM and SGLang are open-source, the closed-loop optimizations each company adds—custom CUDA kernels, precise cache scheduling heuristics, hardware-specific tuning—represent genuine competitive advantages. A company that can achieve 80% GPU utilization while competitors achieve 60% has transformed their cost per token by 25% without any algorithm changes.
Hardware Architecture and Its Impact on Model Design
The hardware constraints discussed in serving directly influence how models are architected during training. Modern models don't just happen to have sparse experts, enormous context windows, and particular parameter counts. These design choices reflect careful optimization for the hardware that will serve them.
Consider GPU memory architecture. NVIDIA's latest Blackwell chips (GB200/GB300) connect 72 GPUs within a single rack via NVLink, creating approximately 20TB of aggregate HBM memory. This isn't accidental—models are sized to fit within this constraint. A 5-trillion parameter model in FP8 precision consumes exactly 5TB. With activation memory and KV cache, the math becomes: 5TB (weights) + up to 10-15TB (KV cache for typical serving loads) ≈ 20TB total. This is why we suddenly see 5T models appearing as NVL72 rolls out—the hardware was designed to serve models of exactly this scale.
Similarly, sparsity ratios in mixture-of-experts models are tuned to the compute-memory ratio. Deeper Reasoning models using reasoning tokens consume enormous computation. By implementing aggressive sparsity (activating only 1/12 of experts), frontier labs ensure that even with massive token volumes, memory bandwidth isn't overwhelmed. The reasoning compute happens quickly; KV cache is accessed efficiently.
Context window design reflects memory tier economics. At 200,000 tokens, models enter the hardware inflection point where additional context becomes disproportionately expensive to serve. This is why most models max out around 200k native context, with only a few pushing to 1M+ (which is more expensive to support). Some models implement context compression or retrieval augmentation to work within the economically optimal range.
The Future: Inference-Driven Model Development
The transition from training-dominated to inference-dominated AI development is perhaps the most significant structural shift happening in AI right now, yet it remains largely invisible to observers focused on model capabilities.
When frontier labs announce new models, they're implicitly announcing corresponding inference infrastructure improvements. GPT-4.5 appeared around the same time as certain GPU connectivity advances. DeepSeek V3's efficiency gains (1/3 compute, 1/10 memory compared to prior dense models) enable serving on hardware that would struggle with denser architectures. Claude 3.5 Opus's 200k context window aligns with the economically sustainable serving tier.
Looking forward, expect models to become increasingly sparse and specialized. Mixture-of-experts will become more aggressive. Reasoning tokens will proliferate (they're just KV cache entries, consuming memory but enabling better output). Retrieval augmentation will replace ever-longer contexts in many applications. All these trends reduce memory bandwidth requirements or improve compute-memory balance—directly improving serving economics.
The firms that master the hardware-software co-design of inference infrastructure will dominate. Not because their models are marginally better, but because they can serve them at half the cost of competitors. At scale, cost becomes destiny.
Conclusion
Modern large language model inference is fundamentally a problem of hardware efficiency, not algorithmic novelty. The T_compute and T_mem framework explains why APIs are priced as they are, why certain architectural choices are made, and why infrastructure engineering has become the primary competitive advantage.
Every token you generate from Claude, ChatGPT, or any frontier model runs through this infrastructure—optimized to pack your compute into that precise 20-millisecond cycle alongside thousands of other users' tokens, all operating at the edge of physical hardware limitations. The invisible machinery behind AI's remarkable affordability is not remarkable algorithms; it's remarkable engineering.
Understanding this transforms how you interpret industry announcements, evaluate AI economics, and predict which companies will thrive in an increasingly competitive AI services market. Frontier labs aren't primarily competing on model capabilities anymore—they're competing on serving infrastructure efficiency. The math is inescapable, and the winner will be whoever masters the hardware-software interface most completely.
원문출처: EP 96. LLM 추론 인프라와 토큰 경제학
powered by osmu.app