Skip to main content
AI engineering glossary

What is Tool Use (in AI / LLMs)?

Tool use (also called function calling) is the capability of an LLM to invoke external functions, APIs, or other systems during inference: allowing the model to retrieve information it doesn't know, perform actions in the real world, or compute results that would be unreliable to generate from training data alone (math, code execution, database queries).

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Tool use is what transforms an LLM from a text predictor into something genuinely useful for production systems. A frontier model with no tools knows whatever was in its training data and can't act on the world. Add tools (search, retrieval, computation, API calls) and the model becomes capable of grounded answers, real-world actions, and reliable computation. Tool use was first formalized in OpenAI's function calling API (mid-2023), then standardized across providers (Anthropic tools, Google function calling). Modern LLMs handle tool use natively: the developer declares available tools with structured schemas, the model decides when to invoke them, the framework executes the call, and results flow back into the model's context for synthesis. This pattern is the foundation of every modern AI agent.

How tool use works

Three steps. (1) Tool declaration: the developer provides a schema describing each available tool, name, description, parameters, expected return shape. The schema is included in the model's system context. (2) Model decides: when generating a response, the model can either produce text or emit a tool call (a structured request to invoke a specific tool with specific arguments). The decision is part of the model's normal generation. (3) Framework executes and feeds back: the runtime executes the tool, captures the result, and feeds it back to the model as a 'tool result' message. The model then continues generating, now with the tool's output in context.

What makes tool use reliable

Three engineering disciplines. (1) Tool design: clear descriptions, well-typed parameters, consistent error handling. Vague tool descriptions are the #1 cause of agent failures. (2) Argument validation: never trust LLM-generated arguments, validate types, ranges, and business rules before executing. (3) Result formatting: tool outputs need to be LLM-friendly (structured but readable, with explicit schema). Returning raw API responses often leaves models confused; transforming responses into clear summaries dramatically improves performance.

Common tool patterns

Search tools (web search, knowledge base retrieval). Computation tools (math, code execution via Python interpreter, formula evaluation). Data access tools (database queries, API calls to internal systems). Action tools (send email, create ticket, write file, call API endpoint). Each pattern has different reliability characteristics: read-only tools are safe, action tools require human checkpoints for consequential operations, computation tools need result validation.

Use cases

  • Grounding agents in real-time data via web search or API calls
  • Computing reliable answers (math, code execution) instead of relying on model arithmetic
  • Allowing agents to update systems (CRM, project management, financial systems)
  • Document and image processing via specialized tools
  • Custom enterprise integrations (Salesforce, Snowflake, internal APIs)
  • Multi-modal extensions (vision, audio, structured data) via tool calls

Examples in production

OpenAI function calling

OpenAI introduced function calling in mid-2023 and remains the most-deployed tool-use implementation. Supports parallel function calls and structured outputs.

Source

Anthropic tool use

Anthropic's tool use API for Claude: supports structured tool definitions, parallel tool calls, and integrates with computer use for desktop automation.

Source

Google Gemini function calling

Google's Gemini models support function calling with native multi-modal tool integration.

Source

Stripe function calling case study

Stripe documented their production deployment of LLM tool use for customer support agents handling refunds, account changes, and policy questions.

Source

Tool Use compared to alternatives

AlternativeChoose Tool Use whenChoose alternative when
RAG (Retrieval Augmented Generation)
Pre-retrieve documents based on the query and inject into context
Tool use when the model decides at generation time which information to retrieve, supports multiple types of action, or needs to act on the world.RAG (without tool use) when retrieval is the only action needed and you want predictable single-shot retrieval before generation.
Pure prompt-based reasoning
Asking the model to compute or reason without external tools
Tool use for math, code execution, current data, or any task where reliable computation matters.Pure prompt-based for tasks the model can do internally well: text generation, classification, simple reasoning.

Common pitfalls

  • Vague tool descriptions: models call the wrong tool because the description is ambiguous. Tool descriptions are product copy: write them well.
  • No argument validation: trust LLM-generated arguments at your peril. Validate types, ranges, and business rules before executing.
  • Action tools without human checkpoints: tools that send emails, modify databases, or transfer money should require human approval or have undo paths.
  • Returning raw API responses: dumping JSON back to the model often confuses it. Transform results into LLM-friendly summaries.
  • Tool result tokens: rich tool results consume context. For high-volume systems, summarize tool outputs to control cost.
FAQ

Questions about Tool Use.

Effectively yes, with minor terminology differences across providers. OpenAI uses 'function calling.' Anthropic uses 'tool use.' Both refer to the same underlying capability: structured LLM invocation of external functions with typed parameters and structured responses. Use either term: your audience will know what you mean.

Frontier models handle 20-50 tools effectively. Beyond that, model performance degrades because the tool selection problem grows. For systems with many tools, common patterns include: tool routing (a small model picks which tools to expose to the main model), hierarchical tool organization, and on-demand tool discovery.

Raw API tool calling is fine for simple cases (1-5 tools, single-shot or shallow agent loops). Frameworks (LangGraph, Anthropic's agent SDK, custom orchestrators) add value when you need state management, parallel execution, error recovery, observability, or human-in-the-loop checkpoints. Most production systems benefit from framework adoption past a certain complexity.

Three layers: (1) clear, specific tool descriptions that disambiguate similar tools; (2) Pydantic or JSON Schema validation of arguments before execution; (3) detailed error messages back to the model so it can correct mistakes. With these in place, frontier models call tools correctly 95%+ of the time on well-designed schemas.

Each tool call adds the LLM round-trip time (typically 500-2000ms) plus tool execution time. Modern providers support parallel tool calls (multiple invocations per round trip) which dramatically improves latency for fan-out workflows. For latency-critical systems, batch tool calls and minimize sequential dependencies.

Work with BearPlex

Need help implementing Tool Use?

BearPlex builds production AI systems that use Tool Use for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.