A language model is a brain in a sealed jar. It can reason, write, and argue beautifully about a world it cannot touch. Ask it today's weather and it will either refuse or, worse, invent something plausible. Ask it to send an email and it can only describe sending one. Its knowledge stops at a training cutoff and its arms don't reach past the prompt.
Tool use cuts a hole in the jar. You hand the model a menu of functions it's allowed to call — fetch a price, query a database, run some code, book a meeting — and now the brain can act. Everything before this pattern was about orchestrating thinking. This is the one that lets the agent do.
Who decides, who executes
The flow has a clean division of labor, and getting it straight kills a lot of confusion. The model never runs anything. It requests. You run things.
- You describe the available tools to the model.
- Given the request, the model decides whether it needs one — and if so, emits a structured object: which tool, what arguments.
- Your code intercepts that, validates the arguments, and executes the real function.
- The result goes back to the model.
- The model uses it to answer, or to decide on the next call.
That loop repeats until the model stops asking for tools. The model is the dispatcher; your harness is the hands. It proposes; you dispose.
A tool is three things, and one of them matters most
A tool definition is a name, a description, and a parameter schema. Here's the whole loop with one tool, period-correct:
import json
from anthropic import Anthropic
client = Anthropic()
tools = [{
"name": "get_weather",
"description": "Get current weather for a city. Use this whenever the user "
"asks about current conditions, temperature, or forecast.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'London'"}
},
"required": ["city"],
},
}]
def run(user_msg):
messages = [{"role": "user", "content": user_msg}]
while True:
resp = client.messages.create(
model="claude-sonnet-4-5", max_tokens=1024,
tools=tools, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
return resp
results = []
for block in resp.content:
if block.type == "tool_use":
out = WEATHER_API(**block.input) # your real function
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(out),
})
messages.append({"role": "user", "content": results})
Stare at the description field, because that string is doing more work than the rest of the file combined.
The description is a prompt, so write it like one
The single biggest lever on tool-calling quality isn't the model — it's how you describe the tool. A vague description ("gets weather") and the model calls it at the wrong time, or doesn't call it when it should, or fills in garbage arguments. A description that says when to use the tool, not just what it does, steers the model far more reliably.
Anthropic spelled this out in Writing effective tools for AI agents: tool specs are prompt-engineered context, and treating them as throwaway boilerplate is leaving most of your accuracy on the floor. Be prescriptive. Name the trigger conditions. Give an example argument in the parameter description. The model reads all of it, every call.
And validate before you execute. The model can hallucinate a malformed argument — a string where you wanted an enum, a missing required field. Enforce the schema (Pydantic, strict JSON), and when it fails, return a clear error back to the model so it can retry instead of crashing your harness. A good error message is itself a tool: it teaches the model what it got wrong.
Too many tools, and other ways to lose
Tool use degrades in a specific, measurable way as you pile on more tools. Past roughly ten or twenty, accuracy starts to slide — the model gets confused about which of forty near-identical functions to reach for. The Berkeley Function Calling Leaderboard, formalized in the BFCL ICML 2025 paper, exists precisely to measure this across serial, parallel, and multi-turn calls, and the rankings make the failure modes hard to wave away.
The scaling answer isn't "add more tools." It's to stop showing the model everything at once. Route to a relevant subset (the routing pattern earns its keep here), or retrieve tools the way Gorilla and ToolLLM do — pull only the handful of APIs relevant to this request out of a catalog of thousands. There's also a newer angle: Anthropic's Code execution with MCP argues agents scale better when they write code that calls tools than when they invoke tools one at a time through the model — the code becomes the orchestration layer, and the model stops being a bottleneck for every single call.
Which is a good moment to mention MCP. The Model Context Protocol is the open standard for exposing tools and data to agents so you're not hand-wiring every integration. It gets its own series later; for now, just know that "how do I connect my agent to fifty tools without losing my mind" has a protocol-shaped answer.
Guard the things that touch the world
The whole point of tool use is real effects, which is exactly why it's the most dangerous pattern in this set. A chain that summarizes a document wrong wastes some tokens. A tool call that sends the wrong email, charges the wrong card, or drops the wrong table does damage you can't take back.
Never blindly execute model-generated arguments for an action that's destructive or outbound. Gate it. Put a confirmation step in front of anything irreversible — a human approval, a dry-run, a sandboxed permission boundary. Read-only tools can run freely; tools that change something earn a checkpoint. This is where tool use hands off to the human-in-the-loop and guardrail patterns, and it's not optional polish. It's the difference between an agent that's useful and an agent that's a liability with API keys.
Give a model hands and it becomes genuinely useful for the first time. Give it hands without judging which actions need a second look, and you've automated the mistakes too.
Leave a Reply
Your email address will not be published.