Cost Control for LLM Agents: Tool Calls, Context Windows, and Think Tokens

Feb, 18 2026

Running LLM agents at scale isn’t just about making them smarter-it’s about keeping them affordable. In 2026, companies are seeing monthly bills for their AI agents climb past $250,000 without any real optimization. That’s not a bug. It’s a feature of how most teams build agents today: they throw the biggest model at every problem and hope for the best. But here’s the truth: you don’t need a $100,000 Ferrari to drive to the grocery store. The same logic applies to LLM agents. The real win isn’t raw power-it’s smart efficiency.

Context Windows Are Eating Your Budget

Every time an LLM agent processes a prompt, it reads every token in the context window. That includes the user’s message, past chat history, retrieved documents, tool outputs, and even your internal system notes. Most agents load everything in, no matter how irrelevant. That’s like carrying your entire closet into the bathroom just to pick out a shirt.

The fix? Context pruning. Don’t just cut the fluff-cut the noise. Start by summarizing old conversation turns instead of keeping every word. If the user asked about their order status three messages ago, and you already answered it, don’t keep the full reply. Summarize: "User confirmed receipt of Order #12345." That alone can cut your context size by 30%. Use metadata tags to mark which parts are essential: "Required for decision", "Historical reference", "Irrelevant". Then train your agent to ignore the last two.

One team at a logistics startup cut their average context length from 12,000 tokens to 7,200 using this method. Their monthly cost dropped by $41,000. No model change. No infrastructure upgrade. Just smarter context.

Think Tokens Are the New Hidden Cost

Models like OpenAI’s o1 and DeepSeek R1 don’t just generate answers-they think. They run internal reasoning chains, simulate multiple approaches, and weigh options before responding. These are called "think tokens"-tokens spent not on output, but on internal processing. And they’re expensive.

Think tokens can double or even triple your inference cost per request. But they’re not useless. For complex tasks-like debugging code, analyzing financial reports, or planning multi-step workflows-they’re worth it. The problem? Most agents use them for everything.

The solution is task-based routing. Don’t let every agent use deep reasoning. Build a traffic cop for your AI:

Simple tasks ("What’s my balance?", "Schedule a meeting") → use GPT-3.5 or Claude Haiku. No think tokens needed.
Standard tasks ("Explain this policy", "Draft an email") → use GPT-4o-mini or Claude Sonnet. Light reasoning.
Complex tasks ("Analyze these 10 reports and find inconsistencies") → use GPT-4 or Claude Opus. Let it think.

One SaaS company implemented this and cut their think token usage by 62%. Their agents still solved every hard problem. They just stopped thinking about simple ones. Result? 41% lower cost, same accuracy on critical tasks.

Tool Calls Are Cost Multipliers

Tool calls seem harmless. The agent asks for weather data. Gets it. Uses it. Done. But here’s what happens in practice:

Agent calls API A → 150ms delay
Agent reads result → adds it to context
Agent calls API B → another 150ms
Agent calls API C → again
Agent re-reads all previous results → more tokens

That’s three API calls, three context expansions, and three rounds of LLM processing. Each one adds cost. And if your agent does this 10,000 times a day? You’re paying for 30,000 API calls and 5x more tokens than necessary.

Fix it with three rules:

Batch tools. If the agent needs weather, stock price, and calendar, get them all in one call. Don’t make three separate trips.
Cache results. If the weather for Seattle was queried 10 minutes ago, reuse it. Don’t re-fetch. Semantic caching works here-same location, same time window = same answer.
Eliminate unnecessary calls. Does the agent really need to check the calendar to confirm a meeting time? Or can it infer it from context? Remove the call. Save the token.

A customer support agent at a fintech firm used to make 4.7 tool calls per interaction. After optimization? 1.2. Cost per interaction dropped 74%.

An AI agent shedding cluttered context documents, leaving only a single summarized note, with money falling from the removed papers.

Prompt Engineering Isn’t Just About Clarity-It’s About Economy

You’ve heard "be clear in your prompts." But few realize how much wording affects cost. Every extra word is a token. Every "could you please," "I was wondering," and "just a quick question" adds up.

Compare these:

"Could you possibly provide me with a detailed explanation of how this function works?" → 12 tokens
"Explain this function." → 4 tokens

That’s a 67% reduction in context size. Do that across 5,000 daily interactions? You save 40,000 tokens per day. That’s $800 a month just from trimming filler.

Start a prompt audit. Go through 100 agent prompts. Remove:

"Please," "Thank you," "Could you" (unless human-facing)
"As an AI assistant," (the model already knows)
"In order to," "with the purpose of," "due to the fact that"

Replace with direct commands. Use bullet points. Be surgical. One team cut their average prompt length from 28 to 14 tokens-without losing quality. Their cost per request dropped 38%.

Infrastructure Hacks: Batching, Quantization, and Caching

You can optimize prompts all day, but if your infrastructure is inefficient, you’re still bleeding money.

Continuous batching (like vLLM) lets your GPU handle multiple agent requests in one pass. Instead of waiting for each request to finish, it groups them dynamically. One user saw 23x more throughput. Translation: you can run 23x more agents on the same server. Cost per request? Down 40%.

Quantization shrinks your model. Convert from FP16 to INT4, and you cut memory use by 75%. Llama 3 8B in INT4 performs at 96% of the 70B model’s accuracy-while using 1/6th the memory. For agents that don’t need perfect reasoning (like chatbots or form fillers), this is a no-brainer. You get 50-60% cost savings just by switching formats.

Semantic caching remembers similar queries. If ten users ask "How do I reset my password?" in different ways, cache the best response. Next time, serve it instantly. No LLM call. No cost. One support agent saw 42% of its queries hit the cache. That’s 42% of its monthly bill gone.

A traffic light system routing three AI models by task complexity, with cost tags showing price differences per request.

Model Routing: Not All Agents Need the Same Brain

This is the biggest lever most teams ignore. You don’t need GPT-4 to confirm a delivery date. You don’t need Claude Opus to answer "Is the office open?" Build a routing layer. Use a lightweight classifier (even a simple rule-based one) to decide which model handles each request:

Model Routing Strategy for LLM Agents
Task Type	Model	Cost per 1k Tokens	Latency	Use Case
Simple queries	GPT-3.5 or Claude Haiku	$0.0003	180ms	FAQ, confirmations, greetings
Standard tasks	GPT-4o-mini or Claude Sonnet	$0.0015	320ms	Email drafting, summaries, basic analysis
Complex reasoning	GPT-4 or Claude Opus	$0.015	750ms	Code debugging, financial analysis, multi-step planning

This isn’t theoretical. A healthcare startup routed 85% of its agent traffic to Haiku and Sonnet. Only 15% needed Opus. Their overall cost per agent interaction dropped 46%.

Monitoring: Know Where the Money Goes

You can’t control what you can’t measure. Most teams don’t track cost per request. They track "how many users" or "how many messages." That’s like measuring car mileage without checking fuel consumption.

Start logging:

Token input/output per request
Tool calls per session
Context length before/after pruning
Cache hit rate
Model used

Use tools like MLflow or Weights & Biases to tag each agent run with cost metadata. Set alerts: "If cost per request jumps above $0.005, notify team." One team caught a bug where their agent was re-fetching the same document 12 times per request. Fixed it in 2 hours. Saved $18,000 that month.

Final Strategy: Build Cost Control Into Your Agent Design

Cost control isn’t a post-launch fix. It’s a design principle. Here’s your checklist before deploying any agent:

Define the minimum model needed for the task.
Prune context before every inference.
Limit tool calls to one per decision cycle.
Cache responses for repeated queries.
Use continuous batching and quantization on self-hosted models.
Route tasks to cost-appropriate models.
Monitor every request’s cost in real time.

Teams that follow this from day one cut costs by 30-50% without sacrificing performance. Those who wait? They’re already over budget.

Are think tokens worth the extra cost?

Yes-but only for complex tasks. Think tokens improve accuracy on multi-step problems like code debugging or financial analysis. But for simple queries like "What’s the weather?" or "Book a meeting," they add cost without benefit. Use them selectively. Let lightweight models handle routine tasks.

Can I reduce costs without changing my model?

Absolutely. Most teams can cut 30-40% of their cost just by optimizing prompts, pruning context, and reducing tool calls. You don’t need a new model-you need smarter behavior. Trim filler words, summarize history, cache results, and batch tool use. These changes cost nothing to implement and deliver immediate savings.

Should I use cheaper models like GPT-3.5 for everything?

No. GPT-3.5 is fast and cheap, but it fails on complex reasoning. If your agent needs to analyze data, spot contradictions, or plan steps, it will hallucinate or give incomplete answers. The key is routing: use GPT-3.5 for simple tasks, and only switch to GPT-4 when the task demands deep reasoning. This balance gives you the best cost-to-quality ratio.

How do I know if my agent is overusing tool calls?

Track the number of tool calls per agent session. If it’s more than 2-3 on average, you’re likely overusing them. Look for patterns: Is the agent calling the same tool twice? Is it re-fetching data it already has? Use logging to see which tools are called most often. Then ask: Can this be cached? Can it be combined? Can it be removed entirely?

Is quantization safe for production agents?

Yes, for most use cases. Quantization (like INT4) reduces model size and speed without major quality loss-especially for tasks like chat, summarization, or classification. Testing shows models like Llama 3 in INT4 retain over 95% of the performance of their full-precision versions. Avoid it only if your agent needs perfect numerical accuracy (e.g., medical diagnostics or financial modeling). For everything else, it’s a safe and powerful cost saver.

What’s the fastest way to cut my LLM agent costs?

Start with context pruning and prompt trimming. These require zero infrastructure changes and deliver immediate savings. Then implement task-based model routing. Combine those with semantic caching, and you’ll likely cut 40-50% of your costs within two weeks. The rest-batching, quantization, tool batching-comes later as you scale.

2 Comments

Patrick Sieber
February 18, 2026 AT 10:23

Smart breakdown. I’ve seen teams waste six figures on GPT-4 for chatbot FAQs. The routing table alone is worth printing and taping to every dev’s monitor. No model change needed-just discipline.
Kieran Danagher
February 19, 2026 AT 20:43

Context pruning is the real MVP. I once worked with a team that kept 8000 tokens of chat history just because "it might be useful." Turned out the user asked about their order status once, then never mentioned it again. We cut it to 2000 tokens. Monthly bill dropped by $32k. No one even noticed.