Continuous Batching and KV Caching: Maximizing LLM Throughput

Continuous Batching and KV Caching: Maximizing LLM Throughput Apr, 25 2026
Imagine running a high-end restaurant where the chef only starts a new dish after every single person at a table has finished their entire meal. If one guest orders a complex 12-course tasting menu and another just wants a side of fries, the fries-eater sits there staring at an empty plate for an hour while the chef focuses on the tasting menu. That is essentially how static batching works in LLM inference, and it is a massive waste of expensive GPU power. In the world of Large Language Models, the goal is always the same: get the most tokens out of your hardware for the least amount of money. To do that, we have to move away from rigid request-level processing and toward a more fluid, token-level approach. By combining Continuous Batching is a dynamic scheduling technique that allows new requests to enter and finished requests to leave a batch during the generation process with smart memory management, we can turn a sluggish inference pipeline into a high-throughput engine.

The Memory Tax: Understanding KV Caching

To understand why we need complex batching, we first have to look at how Transformers actually "remember" things. LLMs generate text one token at a time. For every new word produced, the model has to look back at everything that came before it to maintain context. Without a cache, the model would have to re-calculate the mathematical representations of every previous token over and over again. This creates a computational nightmare where the cost grows quadratically-O(n²)-as the sentence gets longer. KV Caching is a mechanism that stores the Key and Value vectors of previous tokens in GPU memory to avoid redundant attention computations. By saving these vectors, the model only needs to compute the new token, reducing the computational cost to a linear O(n). While this sounds like a win, it introduces a "memory tax." For a model with L layers and H attention heads, each token requires a specific amount of space (2 * L * A * H). As your conversation grows, the cache expands. If you have thousands of concurrent users, this memory pressure becomes the primary bottleneck, often leading to "Out of Memory" (OOM) errors or forcing the system to limit the number of users it can handle simultaneously.

The Efficiency Gap: Static vs. Continuous Batching

Most early LLM deployments used static batching. The system would wait for a group of requests (say, 4 or 8), run them all through the GPU, and wait until the very last token of the longest request was finished before starting the next batch. This is where the "restaurant problem" happens. If three requests finish in 10 tokens but one request needs 500 tokens, your GPU cores sit idle for 490 cycles, wasting precious compute cycles. Continuous batching, often called iteration-level batching, solves this by operating at the token level. Instead of waiting for the whole batch to finish, the scheduler checks the state of every request after every single token generation. As soon as a request hits its stop token or reaches its length limit, it is evicted from the batch, and a new request from the queue is slotted in immediately.
Static Batching vs. Continuous Batching Comparison
Feature Static Batching Continuous Batching
Scheduling Unit Request Level Token/Iteration Level
GPU Utilization Low (waits for longest request) High (constant filling of slots)
Latency Higher for short requests Lower, more consistent
Throughput Baseline Typically 2x to 23x higher

Solving Memory Fragmentation with PagedAttention

Even with continuous batching, we hit a wall: memory fragmentation. Traditionally, systems reserved a large, contiguous block of memory for the maximum possible sequence length. If you reserved space for 2,048 tokens but the user only wrote a 10-word prompt, the remaining space sat empty and unusable-this is called internal fragmentation. PagedAttention is a memory management technique that allocates KV cache in non-contiguous fixed-size pages, similar to virtual memory in operating systems. Instead of one giant block, PagedAttention breaks the cache into small blocks. These blocks are allocated on demand. If a sequence grows, the system just assigns another page from the pool. This approach allows multiple requests to share the same physical memory blocks if they have the same prefix (like a common system prompt). By eliminating the need to reserve maximum capacity upfront, PagedAttention dramatically increases the number of requests a single GPU can handle, acting as the foundational engine for systems like vLLM. Isometric illustration of a GPU memory grid with glowing tokens and empty gray fragmented spaces

Advanced Optimizations: Chunked Prefill and Prefix Sharing

One of the biggest challenges in high-throughput systems is the "prefill" phase. When a user sends a long prompt, the model must process all those tokens at once before it can start generating. This creates a massive spike in compute that can stall the generation of tokens for other users already in the batch. To fix this, modern systems use chunked prefill. Instead of processing a 2,000-token prompt in one go, the system breaks it into smaller chunks (e.g., 512 tokens). This allows the scheduler to interleave the prefill of a new request with the decoding tokens of existing requests, smoothing out the compute load and preventing "stuttering" in the user experience. Furthermore, many production environments use a global prefix tree. If 1,000 users are all chatting with a bot that has the same 500-word "You are a helpful assistant..." system prompt, the system shouldn't calculate and store that prompt 1,000 times. By hashing the prefix and mapping it to a single set of KV blocks, the system saves gigabytes of memory and reduces the Time to First Token (TTFT).

Real-World Performance Gains

These aren't just theoretical tweaks; the numbers are staggering. In benchmarks conducted by the vLLM team, continuous batching delivered 10-20x throughput improvements over traditional static methods. Other industry leaders, such as Anyscale, have reported gains as high as 23x. For a business, this is the difference between needing 20 H100 GPUs to serve a user base or needing only 2. Since GPU compute is one of the highest operational costs in AI, these optimizations directly impact the bottom line. The ability to maximize the "tokens per second per dollar" ratio is what makes large-scale LLM products commercially viable. Illustration of a fast-moving conveyor belt with tokens being swapped in and out efficiently

The Trade-offs: Memory Pressure and Latency

Nothing is free in systems engineering. While these techniques maximize throughput, they can introduce new pressures. As you increase the batch size to saturate the GPU, you consume more memory for the KV cache. If the memory fills up completely, the system may have to pause new requests or even evict existing ones to make room. When memory pressure peaks, you'll notice a decline in two key metrics:
  • Time to First Token (TTFT): The delay between hitting "send" and seeing the first word appear. High memory contention increases this delay.
  • Tokens Per Second (TPS): The actual speed of the text streaming. If the GPU is overwhelmed by a massive batch, the generation speed for each individual user may drop.
Recent research, such as the BatchLLM paper (April 2024), attempts to solve this by reordering requests. By prioritizing requests with a high decoding-to-prefill ratio, the system can better interleave work and reduce memory pressure through more aggressive KV reuse and dynamic programming on the prefix tree.

What is the main difference between static and continuous batching?

Static batching waits for all requests in a batch to finish before starting a new one, meaning the shortest requests are held hostage by the longest one. Continuous batching operates at the token level, evicting finished requests and adding new ones instantly, which keeps the GPU fully utilized at all times.

Does KV caching increase memory usage?

Yes. While KV caching drastically speeds up generation by avoiding redundant calculations, it requires storing key and value vectors for every token in GPU memory. As the sequence length and batch size increase, the memory demand grows linearly, which can lead to memory exhaustion if not managed with techniques like PagedAttention.

How does PagedAttention reduce fragmentation?

PagedAttention treats GPU memory like virtual memory in an OS. Instead of reserving a massive, contiguous block for the maximum possible prompt length, it allocates small, fixed-size pages on demand. This prevents "wasted" space when a user's request is shorter than the pre-allocated maximum.

What is chunked prefill and why is it useful?

Chunked prefill breaks long initial prompts into smaller pieces. This prevents a single massive prompt from hogging the GPU and pausing the token generation for other users in the batch, leading to a smoother and more consistent user experience.

Which libraries implement these techniques?

Many of these optimizations are found in open-source serving frameworks like vLLM and NVIDIA's TensorRT-LLM. These libraries implement continuous batching, PagedAttention, and prefix caching to help developers maximize their hardware efficiency.

Next Steps for Implementation

If you are moving from a prototype to a production environment, your choice of serving infrastructure is critical. For those managing their own hardware, implementing a framework like vLLM is the fastest way to get continuous batching and PagedAttention without writing the CUDA kernels yourself. If you are using a managed provider, ask about their batching strategy. Specifically, check if they support prefix caching for long system prompts-this can drastically reduce your costs and latency if your application uses a consistent set of instructions across all users. For teams facing extreme memory pressure, consider experimenting with quantization (like FP8 or INT8), which reduces the size of the KV cache vectors and allows for even larger batch sizes on the same hardware.