Skip to main content
AI engineering glossary

What is a Token in LLMs?

A token is the fundamental unit of text that LLMs process (typically a word, subword, or punctuation mark) produced by a tokenizer that splits raw text into integer IDs the model can compute on, where roughly 1 token equals 0.75 English words on average.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Everything LLMs do is measured in tokens: context windows, output limits, pricing, latency. Understanding tokenization is essential because it explains otherwise-confusing behaviors: why some words cost more than others, why non-English languages cost more per character, why code consumes more tokens than prose, and why the model occasionally splits 'unexpected' words mid-character. Modern tokenizers (BPE for GPT, SentencePiece for Llama, tiktoken for OpenAI, Anthropic's custom tokenizer for Claude) all share the same fundamental approach but differ in vocabulary size, language coverage, and how aggressively they merge frequent subword pairs.

How tokenization works

Tokenizers convert raw text strings into sequences of integer token IDs that the model's embedding layer then maps to vectors. The most common approach is Byte-Pair Encoding (BPE): start with characters as atomic tokens, then iteratively merge the most-frequent character pairs into new tokens until you have a fixed vocabulary size (typically 30K-200K tokens). Common words become single tokens ('the', 'and', 'AI'), while rare words get split into pieces ('Anthropic' might be one token, but a rare medical term might split into 4-5 subwords). Special tokens mark structure ([CLS], <|im_start|>, <|tool_use|>) and are reserved for the model's internal grammar.

Why token counts vary by language and content

English tokenizes most efficiently (about 0.75 tokens per word on average) because English-language text dominates the training data tokenizers were optimized for. Non-English languages tokenize less efficiently: Chinese, Japanese, Korean, Hindi, Arabic, and most non-Latin scripts often tokenize at 2-4× the cost of English for the same semantic content. Code tokenizes inefficiently too because variable names, indentation, and punctuation all consume separate tokens. JSON is particularly token-heavy because of all the brackets and quotes. The practical impact: serving non-English users or working with code often costs 2-3× more per request than English text of similar length.

Tokens, latency, and cost

Almost everything about LLM economics is per-token. Pricing: providers charge separately for input tokens (cheaper) and output tokens (more expensive, typically 2-5× input cost). Latency: time-to-first-token depends on input length (the model must process all input before generating); inter-token latency is roughly constant; total response time = input tokens × prefill speed + output tokens × generation speed. Context windows: every token in the prompt counts against the limit. Optimization in production usually means reducing token count: tighter prompts, shorter system messages, prompt caching for repeated prefixes, and structured output to reduce unnecessary verbosity.

Use cases

  • Estimating cost before sending a request (input tokens × input price + expected output tokens × output price)
  • Capping response length via max_tokens parameter to control cost and latency
  • Optimizing prompts to fit within context window limits
  • Comparing model pricing across providers on a per-token basis
  • Using prompt caching to amortize repeated prefix tokens at 90% discount

Examples in production

OpenAI tiktoken

Open-source Python tokenizer that exactly replicates GPT-3.5/GPT-4 tokenization, used to count tokens before API calls.

Source

Hugging Face Tokenizers

Fast Rust-backed tokenizer library supporting BPE, WordPiece, SentencePiece, and Unigram: the tokenizer for most open-source models.

Source

Anthropic

Claude API includes a count_tokens endpoint for exact token counting before sending requests, helpful for prompt budgeting.

Source

Token compared to alternatives

AlternativeChoose Token whenChoose alternative when
Characters
Raw character count: what naive text length measures
Use tokens for any LLM cost or context window calculation: they're what the model actually countsUse character count only for display or basic length estimates
Words
Whitespace-separated word count
Use tokens for precise LLM-related sizing: about 1 token per 0.75 words for EnglishUse word count for human-readable length estimates

Common pitfalls

  • Estimating tokens via 'characters / 4': works for English prose, breaks for code and non-English text
  • Forgetting that output tokens cost more than input tokens: long-form generation is expensive
  • Sending raw HTML, JSON, or XML when a more compact representation would work: markup is token-heavy
  • Not capping max_tokens: runaway responses can exceed context windows and cost
  • Assuming all providers tokenize identically: they don't; the same text is different token counts on GPT vs Claude vs Gemini
FAQ

Questions about Token.

Use the provider's official tokenizer: tiktoken for OpenAI, Anthropic's count_tokens API endpoint for Claude, the Hugging Face tokenizer matching the open-source model you're using. Don't estimate via character / 4 except as a rough sanity check: it's wrong for code and non-English text.

Tokenizers are trained on a corpus dominated by English. Non-Latin scripts, especially Chinese, Japanese, Korean, Hindi, Arabic, often tokenize at 2-4× the rate of English for equivalent content. Some open-source models (mBERT, XLM-R) use multilingual tokenizers that even out costs across languages, but most frontier model tokenizers favor English.

No: output tokens are typically 2-5× more expensive than input tokens because generating each token requires a full model forward pass while input tokens are processed in parallel during prefill. This pricing structure is universal across providers and a key consideration when designing applications: a workflow that summarizes long input into short output is much cheaper per request than one that expands short input into long output.

For chat-style responses, 1024-2048 max_tokens is typical. For document generation, 4096-8192. Always set max_tokens explicitly: running without it lets the model generate up to the model's full context window, which can be expensive and slow.

Work with BearPlex

Need help implementing Token?

BearPlex builds production AI systems that use Token for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.